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ABSTRACT 

A  major  challenge  facing  computer  architects  today  is  designing  cost-effective  hard¬ 
ware  that  executes  multiple  operations  simultaneously.  The  goal  of  such  designs  is  to  im¬ 
prove  performance  by  taking  advantage  of  fine-grain  parallelism.  In  this  dissertation,  I 
study  vector  architectures,  the  oldest  of  several  processor  designs  that  support  fine-grain 
parallelism.  Because  implementing  a  cost-effective  processor  that  performs  well  requires 
studying  not  only  the  design  of  processors  but  also  the  design  of  algorithms  for  compilers, 
this  dissertation  encompasses  aspects  of  both  hardware  and  software  design. 

In  the  first  half  of  this  dissertation,  I  demonstrate  that  a  vector  architecture  is  a 
cost-effective  processor  that  supports  fine-grain  parallelism.  I  show  that  implementing  a 
vector  architecture  is  no  more  costly  than  implementing  a  superscalar  architecture,  which 
is  currently  popular  among  designers  of  VLSI  microprocessors.  I  then  show  that  programs 
that  are  rich  in  parallelism  tend  also  to  be  vectorizable  and  are  also  the  ones  that  execute  the 
longest  in  a  workload,  thus  demonstrating  further  the  effectiveness  of  vector  architectures. 
Finally,  I  show  that  superpipelined  hardware  in  combination  with  a  vector  architecture  can 
take  advantage  of  what  little  parallelism  is  available  in  non- vectorizable  programs. 

In  the  second  half  of  this  dissertation,  I  investigate  the  cost  and  performance  of 
different  organizations  for  a  vector  register  file  in  the  Cray  Y-MP  vector  processor,  an  inves¬ 
tigation  that  emphasizes  the  interaction  between  processor  design  and  compiler  algorithms. 
After  showing  that  instruction  scheduling  has  a  major  impact  on  how  effectively  more  vector 
registers  can  be  used,  I  present  data  from  simulation  experiments  indicating  that  16  vector 
registers  and  a  Ust  scheduling  algorithm  can  improve  performance  significantly  over  that 
of  8  vector  registers  and  the  scheduling  algorithm  used  in  the  Cray  vectorizing  compiler.  I 
also  investigate  the  usage  of  an  alternative  register  organization,  called  a  partitioned  vec¬ 
tor  register  file,  which  is  less  costly  to  implement  than  a  traditional  one  but  places  some 
restrictions  on  accessing  vector  registers.  To  circumvent  this  restrictive  access,  I  develop 
an  algorithm  for  assigning  vector  registers  and  present  data  showing  that,  when  using  my 
algorithm,  the  performance  of  a  partitioned  vector  register  file  is  comparable  to  that  of  a 
traditional  one. 
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Chapter  1 

Introduction 


In  the  fall  of  1990, 1  worked  at  Cray  Research,  Incorporated  with  the  architecture 
group  that  is  designing  the  follow-on  to  the  Cray  Y-MP  C-90,  which  in  turn  is  the  successor 
to  the  Cray  Y-MP,  the  classic  embodiment  of  a  vector  architecture.  My  project  was  to 
answer  the  following  question: 

How  many  vector  registers  are  enough  to  effectively  use 
the  functional  units  of  the  Cray  Y-MP  vector  processor? 

What  this  question  really  means,  how  I  went  about  answering  this  question,  and  the  actual 
answer  itself  are  described  in  two  chapters  of  this  dissertation. 

In  the  course  of  answering  this  question,  I  developed  two  compiler  algorithms,  a 
vector  instruction  scheduler  and  a  vector  register  assigner.  I  also  examined  the  design  of 
a  vector  processor  to  determine  the  cost  of  implementing  different  configurations  of  vector 
register  files.  Finally,  I  carried  out  simulation  experiments  to  evaluate  the  effectiveness  of 
the  algorithms  I  developed  and  to  measure  the  performance  of  different  register  files  as  well 

as  to  determine  the  answer  to  the  above  question. 

In  addition  to  answering  the  above  question,  my  dissertation  also  addresses  the 
more  fundamental  question:  “Why  do  research  in  vector  architectures?”  The  short  -  but 
not  often  heard  —  answer  is  “because  a  vector  architecture  is  an  inexpensive  processor 
design  that  supports  fine-grain  parallelism  well.”  The  longer  version  of  this  answer  is  given 
in  two  chapters:  one  that  contrasts  how  vector  architectures  support  fine-grain  parallelism 
in  comparison  to  other  architectural  classes,  and  another  chapter  that  explains  how  vector 
architectures  are  inexpensive  to  implement. 


1.1  Definitions 

Before  I  present  an  overview  of  my  dissertation,  I  discuss  my  usage  of  particular 

words.  ^  discipline  of  computer  science,  the  term  architecture  is  typically  used  as  an 
abbreviation  for  computer  architecture,  which  refers  to  the  organization  and  implementa¬ 
tion  of  a  computer.  A  computer,  in  turn,  consists  of  three  major  components:  processor, 
memory,  and  input/output.  For  my  thesis,  I  examine  the  design  of  only  the  processor  com¬ 
ponent,  which  is  also  commonly  known  as  the  central  processing  unit  or  CPU  for  short. 
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Hence,  in  this  dissertation,  I  nse  the  term  of  a 

architecture  to  refer  to  the  organization  and  .mplementatron  of  a  processor 

entire  computer.  iruttmction.  An  operation  is  a 

I  make  a  distinction  between  an  ia  aasoa^tai  »«» 

task  executed  by  an  instruction.  n  ins  ^  ^  ^  jt  of  ^ork  that  is  examined  during 

particnlar  processor  design  and  represents  p„rthermore,  an  instrnction 

one  clock  period  by  the  i„st,uct.on-.ssne  architecture,  and 

can  cause  one  or  more  operations  to  execute,  ^  ,  F  operands  and  results  of 

specifies  no.  only  what  operations  ‘ptiSzTon,  I  use  the  more 

those  operations  are  located.  In  descri  g  g  execution  and  hence  has  fewer 

abstract  term  ;;P^~“Texlc‘:ted“  Curp"  ‘ 

restrictions  on  how  it  should  be  execu  .  P  P  ,  algorithm  for  instruction 

set  of  operations  into  a  sequence  of  mstruc  lons.  ^  ^  fo, 

scheduling  determines  in  what  order  a  set  of  “rropTratiou  L  located, 

register  assignment  determines  where  “u^  ‘kan  one  oper¬ 

ation  to  e^.::  r  r == 

accurately  depicts  the  amount  of  parallehsm  ^  PP  fiL-erain  parallelism,  rather  than 

more,  it  is  more  insightful  to  compare  ^  ®  because  different  architec- 

instruction-level  parallehsm,  suppor  e  y  -TietTuction  but  an  operation  specifies  the 

tures  can  specify  different  amounts  of  work  for  an  mstruc  ion  but 

same  amount  of  work  across  different  arc  i  ec  _  ’  uniprocessor.  (Although  the 

lelism  to  refer  to  the  parallelism  that  can  be  support  j  ^L-^Ta^Ze-a^ia  paJuHsra 

a  part  of  hardwire 

“horror;:::: 

port,  which  executes  a  memory  operation,  is  a  special  purp 
as  the  interface  between  a  processor  and  its  memory  system. 

1.2  Overview  of  Dissertation 

This  dissertation  has  two  common  themes:  the  use  of  line-grain  parallelism  to 
^  r  ..H  the  cooneration  between  software  and  hardware  to  design  a  cost- 

:rorto.hirquestionmus^^^^^^^^^^^^ 

this  question  requites  un  ^  |  .  ^apters  in  this  dissertation  develops  these 

t:;re::^h  ;trsi“t'  emphLizi„g'’«ne-grain  parallehsm  and  the  las.  two 
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concentrating  on  how  aspects  of  a  vector  processor  interact  with  aspects  of  a  compUer. 

In  addition  to  providing  background  information  about  vector  architectures  and 
fine-grain  paraUelism,  the  first  two  major  chapters  address  the  question  “Why  do  resear^ 
in  vector  architectures?”  I  begin  Chapter  2,  Fundamentals  of  Vector  Architectures  with 
a  short  discussion  on  data  dependence,  a  concept  that  affects  both  hardware  and  software 
when  using  parallelism.  Next  I  describe  how  the  hardware  features  of  a  vector  architecture, 
and  in  particular  the  vector  instruction,  support  fine-grain  parallelisin  and  contrast  these 
features  with  those  of  other  architectures  that  support  fine-gram  parallehsm.  Because  not  all 
parts  of  a  program  can  be  executed  with  vector  instructions,  I  then  describe  the  properties 
of  a  vectorizable  program  fragment  and  explain  how  a  compUer  can  identify  these. 

At  the  ASPLOS^  conference  in  1991,  cries  of  “Vector  architectures  are  history, 
were  heard  throughout  the  sessions.  The  rapid  advancement  of  VLSI  tecinolo^  and  the 
trend  in  microprocessor  design  towards  superscalar  architectures  have  led  to  this  predic¬ 
tion  of  the  vector  architecture’s  imminent  demise.  Because  of  this  dire  prediction,  many 
people  may  mistakenly  believe  that  research  in  vector  architectures  is  a  futUe  activity.  To 
counter  this  prediction  as  weU  as  these  mistaken  beliefs,  in  Chapter  3,  A  Case  for  Vector 
Architectures,  I  present  arguments  with  accompanying  data  that  emphasize  the  hardware 
and  software  strengths  of  vector  architectures  and  the  weaknesses  of  superscalw  ones. 

Chapter  4,  Common  Experimental  Framework  is  a  short  one  in  which  I  describe 
the  basic  vector  hardware,  performance  tools,  and  workload  used  in  the  empirical  studies 

carried  out  in  the  next  two  chapters.  _  ,  .  , 

In  the  last  two  major  chapters,  I  answer  the  question  posed  in  the  opemng  para¬ 
graph.  In  Chapter  5,  Register  Usage  and  Instruction  Scheduling,  I  analyze  the  performaiice 
of  using  more  vector  registers.  As  part  of  this  analysis,  I  also  show  that  algorithms  for 
instruction  scheduling  have  a  major  impact  on  how  effectively  more  registers  axe  used  to 
improve  performance.  I  present  empirical  data  that  determines  the  immmum  number  of 
vector  registers  needed  to  significantly  improve  performance  over  the  current  design  of  the 
Cray  Y-MP  vector  processor.  In  Chapter  6,  Bus  Usage  and  Register  Assignment,  I  exam¬ 
ine  the  cost  of  using  more  vector  registers  and  investigate  a  special  orgamzation,  I  call  a 
partitioned  vector  register  file,  that  is  less  costly  but  more  restrictive  in  its  access  to  in  - 
vidual  vector  registers.  For  this  investigation,  I  describe  a  register  assignment  algorithin  1 
developed  that  uses  such  a  restrictive  organization  with  minimal  loss  in  performance.  1  also 
present  empirical  data  for  choosing  a  register  organization  that  is  most  cost-efffective  for 

improving  performance.  ,  ,  , .  w 

In  the  closing  chapter.  Concluding  Remarks,  I  summarize  my  work  by  mghhghting 

the  contributions  of  this  dissertation  and  finish  by  discussing  extensions  of  this  work  for 
future  study. 


^ASPLOS  is  an  acronym  for  Xrchitecturai  Support  for  Programming  Languages  and  Operating  Systems. 
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Chapter  2 

Fundamentals  of 
Vector  Architectures 


In  this  chapter,  I  describe  the  fundamentals  of  vector  hardware  and  compilation 
to  demonstrate  how  suitable  a  vector  architecture  is  for  supporting  fine-gram 
Because  my  dissertation  examines  the  interaction  between  vector  hardware  ^d  compiler 
algorithms,  the  information  in  this  chapter  also  serves  as  background  maten^  for 
remaining  chapters.  This  discussion  outlines  problems  faced  by  any  architecture  that  sup 
ports  fine-grain  parallelism  and  differentiates  what  is  specific  to  a  vector  architecture.  I 
Lgin  with  a  short  discussion  on  data  dependence,  a  concept  that  . 

chitecture  that  uses  parallelism.  I  then  describe  the  hardware  capabilities  n^ded  to  support 
fne  grr  parallelism  and  contract  how  a  vector  architecture  and  three  other  ardiitectures 
PovL  this  support.  Because  not  all  parts  of  a  program  can  be  executed  with  vector 
instructions,  I  next  describe  the  properties  of  a  vectorizable  program  fragment,  using  the 
hardware  as  the  basis  for  justifying  each  property,  and  outline  how  a  program  is  transformed 

into  vectorized  code. 

2.1  Data  Dependence 

Correct  parallel  execution  requires  that  multiple  operations  be  executed  simulta¬ 
neously  without  changing  a  program’s  functionality,  which  is  typically  defined  by  the  output 
produced  by  the  scalar  version  of  the  program.  Imprudently  executing  any  operations  n 
parallel  wiU  likely  alter  a  program’s  functionality.  One  way  to  maintain  correct  function^ty 
is  to  guarantee  that  accesses  to  the  same  storage  location  occur  in  the  same  order  as  they 
do  in  the  scalar  version.  In  other  words,  references  to  the  same  location  must  be  serialized 
whereas  references  to  different  locations  can  execute  in  parallel.  Two  references  that  access 
the  same  storage  location  form  a  data  dependence^  Moreover  to  ensure  that  the  correct  value 
is  always  in  the  common  location,  this  dependence  relation  specifies  the  order  in  which  the 
two  references  can  execute:  the  reference  that  accesses  the  common  storage  location  firs 
must  execute  first  [120].  Hence,  any  architecture  that  supports  fine-grain  parallehsm  must 

'Another  type  of  dependence  that  o«urs  in  a  program  is  control  dependence  I  focus  only  on  data 
dependence  in  this  discussion.  Ferrante,  Ottenstein.  and  Warren  show  how  a  compder  can  uniformly  treat 

control  and  data  dependences  [39]. 


DEPENDEl 

HARDWARE  NAME 

VICE  TYPES 

COMPILER  NAME 

access  TYPlib 
AND  ORDER 

RAW 

WAR 

WAW 

flow  dependence 
anti-dependence 
output  dependence 

read  after  write 
write  after  read 
write  after  write 

Figure  2.1:  Hardware  and  Compiler  Names  tor  Dependence  Types 

This  table  shows  the  names  given  hy  the  hardware  M.d  compiler  cominunities  to 
types  of  data  dependences.  The  hardware  names  refer  to  dependences  that  occur 

th'  compiler  names  refer  to  those  that  occur  in  memory.  A  fourth  " 

“dcDendence”  -  is  not  really  a  dependence  because  no  state  is  changed.  RAW  or  flow  dependences 
areLsidered  the  only  true  data  dependences  in  that  they  cannot  be  eliminated  without  jeopardizing 
correct  functionality.  WAR/anti  and  WAW/output  dependences  occur  because  storage  is  fini  e 
every  newly-created  value  were  assigned  its  own  storage  location,  these  dependences  would 

occur. 


maintain  the  orderings  specified  by  all  data  dependences. 

A  dependence  can  be  classified  by  access  type  and  the  order  m  which  two  refer- 
ences  occur  hi  Addition,  data  dependences  can  occur  in  two  different  storage  locations^ 
registers  and  main  memory.  Although  techniques  in  either  hardware  or 
usid  to  detect  work  around,  or  even  eliminate  data  dependences,  hardware  normally  han 
S  es  depfn^™«s  occurring  in  registers  and  a  compiler 

Consequently,  the  hardware  and  compiler  communities  have  given  different  names,  which 
are  Usted  in  Figure  2.1,  to  the  same  dependence  type. 

Register  dependences  can  either  be  avoided  by  the  software  or 
hardware  With  information  about  hardware,  a  compiler  can  assign  vJues  to 
order  to  avoid  WAR  and  WAW  dependences.  These  dependences  can 

hardware  is  upgraded  to  provide  more  parallelism  than  what  was  compiled  for  Fortunately, 
with  appropriaL  hardware,  register-dependent  operations  can  execute  correctly  in  parallel. 
TOs  ts  becau“  register  aldrising  is  explicit,  allowing  hardware  to  accurately  recognise 
when  register  dependences  occur.  An  example  of  a  hardware  rnech^sm  that 
WAR  and  WAW  dependences  is  register  renaming  (to  be  discussed  in  Section  2  2.1)  where^ 
To  /“loiinp  uses  bypass  paths  to  ensure  that  a  RAW-dependent  instruction  reads  the 

“remtTdependences  can  also  be  detected  and  handled  by  either  the  hardw^e 
or  the  compiler  Unfortunately,  there  is  no  comparable  hardware  techmque  to  register 
renaming  or  data  forwarding  that  wiU  allow  dependent  memory  references  to  execute  ^ 
multaneLsly.  At  best,  hardware  that  aUows  out-of-order  execution,  which  is  also  known  ^ 
dynamic  scheduling,  can  be  used  to  minimize  idle  cycles  due  to  dependent  meinory  refer¬ 
ences  Alternatively,  a  compiler  can  detect  memory  dependences  using  analysis  techniqu 
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that  provide  information  not  only  about  memory  addresses  but  also  about  the  access  pa  - 
tern  of  related  memory  references.  Once  dependences  are  detected,  a  compiler  can  order  the 
operations  to  execute  in  parallel  in  such  a  way  that  data  dependences  are  still  preserved^ 
How  hardware  or  compilers  handle  data  dependences  is  one  central  theme  of  this 
dissertation.  Section  2.2  provides  a  fuller  description  on  handling  register  dependences 
in  the  context  of  vector  hardware  and  Section  2.3  details  how  memory  dependences  are 
detected  by  the  compiler.  Compilation  techniques  for  scheduling  around  memory  depen¬ 
dences  and  avoiding  register  dependences  when  using  a  vector  architecture  are  examined  in 

Chapters  5  and  6,  respectively. 


2.2  Hardware  Support  for  Fine-Grain  Parallelism 

A  major  chaUenge  facing  computer  designers  today  is  deter^ning  how  to  execute 
multiple  operations  simultaneously  to  improve  performance.  For  single,  instruction- stream, 
load/Lre  architectures,  research  groups  are  currently  investigating  four 
pipelined,  superscalar,  VLIW,^  and  vector  architectures.  To  support  fine-grain  p^allehsin 
Respective  of  the  architectural  approach,  hardware  must  be  able  to  perform  simultaneously 
more  than  one  instance  of  the  basic  sequence  of  tasks  for  executing  an  operation.  In  other 
words,  hardware  must  be  able  to  simultaneously. 

1.  initiate  multiple  operations, 

2.  fetch  multiple  operands, 

3.  execute  multiple  operations,  and 

4.  store  multiple  results. 

In  the  rest  of  this  section,  I  will  describe  and  contrast  how  each  architectural  class 
these  tasks.  Because  this  dissertation  focuses  on  vector  architectures  as  embodied  by  the 
Cray  Y-MP  I  wiU  give  more  details  on  this  class  of  architecture  than  on  the  other  three. 
Although  all  four  tasks  are  equally  important  in  determining  the  maxiinum  amount  of  par¬ 
allelism  that  can  be  realized,  initiating  multiple  operations  has  received  the  most  attention 
from  computer  designers,  who  have  named  the  architectural  classes  on  the  basis  of  how  each 

class  accomplishes  this  task. 

2.2.1  Multiple  Operation  Initiation 

Initiating  operations  and  issuing  instructions  are  closely-related  activities.  Imtiat- 
ing  an  operation  is  the  first  task  in  the  basic  sequence  for  executing  an  operation,  whereas 
the  method  for  issuing  instructions  determines  how  this  task  is  done  for  more  than  one 
operation  at  a  time.  The  limitation  of  issuing  one  instruction  per  clock  period  h^  become 
kRwn  as  the  Flynn  bottleneck  (based  on  [42]).  Both  superscalar  and  VLIW  architectures 
overcome  this  bottleneck  in  order  to  initiate  more  than  one  operation  simultaneously.  Thi 

*VLIW  U  an  acronym  for  “Very  Long  Instruction  Word,”  suggesting  many  operations  per  instruction. 
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is  not  a  requirement,  however,  because  both  superpipehned  and  vector  arcWtectures  support 
fine-grain  parallelism  and  still  issue  only  one  instruction  each  clock  period. 

In  this  subsection,  I  describe  how  each  architectural  class  initiates  multiple  oper¬ 
ations  simultaneously.  This  is  accomplished  by  extending  the  instruction-issue  mechamsm 
of  a  scalar,  pipelined  architecture  in  a  manner  that  is  reflected  by  the  name  of  a  class. 
Because  there  is  a  large  body  of  programs  already  compiled  for  scalar  architectures,  a  de¬ 
sirable  property  of  such  an  extension  is  that  the  resultant  architecture  can  execute  these 
scalar-compiled  binaries  with  the  possibility  of  improving  performance,  a  characteristic  of 
both  superscalar  and  superpipehned  architectures  but  not  VLIW  and  vector  ones. 

In  addition  to  describing  how  multiple  operations  are  initiated  at  once,  I  provide 
examples  of  implementations  in  each  class.  I  also  discuss  hardware  mechanisms  for  handhng 
register  dependences  because  such  mechanisms  affect  how  frequently  instructions  are  issued, 
which  in  turn  affect  how  much  parallehsm  can  occur. 


Superpipelined  Architectures 

This  architectural  class  supports  fine-grain  parallehsm  by  using  deeper  pipehnes 
and  a  higher  clock  rate  than  those  used  in  a  basic  scalar  machine  [70].  For  example,  whereas 
a  pipehned  machine  wiU  have  a  4-  or  5-stage  pipehne,  a  superpipehned  machine  wih  have 
an  8-  or  10-stage  pipehne  and  half  the  cycle  time.  The  higher  clock  frequency  is  obtained 
through  the  deeper  pipehnes,  hence  the  name  superpipelining.  Although  one  >“struction 
stih  specifies  only  one  operation,  if  the  cycle  time  of  a  pipehned  machine  is  considered  the 
basic  time  unit,  then  a  superpipehned  machine  gives  the  appearance  of  multiple-operation 
initiation  by  issuing  instructions  at  a  faster  rate.  However,  the  inabihty  to  control  clock  skew 
wih  ultimately  hmit  how  fast  the  clock  rate  can  be  made  and  hence  how  much  performance 
can  be  gained  by  this  approach  [57]. 

Because,  in  the  strictest  sense,  operations  are  not  actually  issued  at  the  same 
time,  the  organization  of  a  superpipehned  machine  is  not  significantly  different  from  that 
of  a  pipehned  machine;  the  main  difference  hes  at  the  level  of  hardware  implementation. 
As  a  result,  scalar-compiled  binaries  can  be  executed  with  the  possibihty  of  improving 


performance.  ,  , .  ,  j  #  f 

Superpipehned  processors  have  been  implemented  at  the  high  end  of  the  cost 

spectrum  beginning  with  the  CDC  6600  in  1964  and  continuing  to  this  day  with  the  scalar 
units  of  the  Cray  processors.  More  recently,  in  the  1990’s  the  microprocessor  world,  which 
is  at  the  lower  end  of  the  cost  spectrum,  produced  the  MIPS  R4000,  an  8-stage  pipehned 
implementation  of  the  MIPS  architecture  [71],  and  the  DEC  EV4,  a  64-bit  implementation 
of  the  new  Alpha  architecture  [75]. 


Superscalar  Architectures 

For  this  architectural  class,  fine-grain  parallehsm  is  achieved  by  issuing  multiple, 
scalar  instructions  in  the  same  clock  period,  hence  the  name  superscalar.  Typically,  the 
operational  types  of  the  simultaneous  operations  are  different.  Because  one  instruction  stiU 
specifies  one  operation  and  the  instruction-issue  unit  dynamically  determines  how  many 


8 


instructions  can  be  issued,  scalar- compiled  binaries  can  be  executed  with  the  possibility  of 
improving  performance. 

Among  the  various  architectural  approaches,  the  superscalar  one  is  the  newest  and 
is  currently  the  focus  of  the  commercial  microprocessor  community.  In  1989,  the  super^ 
scalar  implementations,  the  IBM  RS/6000  and  the  Intel  i860,  were  announced.  The  1990 
and  1991  Hot  Chips  Symposium  hosted  several  Fomentations  that  described  superscalar 
designs,  such  as  the  MetaJlow  Lightning  and  the  Sun  SuperSPARC  [94, 

To  increase  the  number  of  instructions  that  can  execute  simultaneously  addition^ 
hardware  allows  multiple  register-dependent  instructions  to  issue  ‘“  the  same 
Bypass  hardware  to  forward  data  as  it  becomes  available  allows  , 

tions  to  be  issued  together  [13].  Because  a  limited  number  of  registers  cause  WAR  and 
WAW  dependences  to  occur,  register-renaming  hardware  ehminates  ^ 

providing  more  physical  registers  than  the  instruction  set  can  speafy.  When  a  logic  g 
fster  is  used  by  two  instructions  in  either  a  WAR  or  WAW  dependence,  the  register  is 
mapped  to  two  different  physical  registers,  thus  removing  the  dependence  and  ^^‘“6  « 

instructions  to  execute  at  once.  Register  renaming  was  first  ‘‘“Pimnented  ™  3®^ 

[112]  and  is  included  in  contemporary  superscalar  processors  such  as  the  IBM  RS/ 

Metaflow  Lightning  [89,  94]. 


VLIW  Architectures 

Like  a  euperscaJar  architectare,  a  VLIW  architecture  also  issues  more  than  one 
operation  per  clock  period,  where  the  type  of  each  operation  is  usually  different. 
whereas  each  operation  in  a  superscalar  architecwre  requires  a  separate  instruction,  m  a 
VLIW  architecture  many  operations  are  encoded  in  a  single  instruction,  resu  ing 
Long  Instruction  Word  from  which  this  architectural  class  takes  iu  name.  Moreover,  a 
compiler  is  responsible  for  grouping  operations  into  a  VLIW  instruction, 
unUke  execution  on  a  superplpeUned  or  superscalar  aichltectur^  scalai-compUed  binaries 
cannot  take  advantage  of  the  parallelism  when  executing  on  VLfW  hardware. 

VLIW  architectures  have  typically  implemented  as  minisupercomputers.  An  emly 
VLIW  computer  is  the  Floating  Point  Systems  AP-120B,  which  was  first  ddvered  in  1976 
1591  In  1983  3. A.  Fisher  actually  coined  the  term  VLIW  to  describe  the  ELI-512,  a  com- 
Lter  that  was  built  at  Yale  University  (lO].  The  Multifiow  IVace  computers,  the  commer- 
dal  version  of  the  ELI.512,  became  available  in  1987  123].  In  1989  Cydrome  announced  its 
VLIW  computer,  the  Cydra  5  (971.  The  Intel  iWarp,  a  commercial  realization  of  a  i«earch 
project  at  Carnegie-MeUon  University,  was  developed  m  the  same  time  period  [6,  14j. 

In  addition  to  extracting  parallelism,  a  VLIW  compiler  is  also  responsible  for  han¬ 
dling  register  dependences.  The  hardware  contains  no  synchronization  mechanisms  and  in 
particular,  does  not  check  for  register  dependences  [23,  97].  It  is  the  compiler  s  responsi  i  - 
ity  to  ensure  that  operations  use  the  correct  register  values.  Although  tbs  precludes  binary 
compatibihty,  it  simplifies  the  control  logic  in  the  hardware. 
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Vector  Architectures 

Like  a  VLIW  architecture,  a  vector  architecture  issues  more  than  one  operation 
per  clock  period  and  relies  on  a  compiler  to  extract  any  parallelism  that  can  use  its  hard¬ 
ware.  However,  whereas  a  VLIW  instruction  causes  multiple  operations  of  different  types 
to  initiate  simultaneously  in  different  functional  units,  a  vector  instruction  causes  multiple 
operations  of  the  same  type  to  initiate  sequentially  in  one  functional  umt.  The  name  vector 
comes  from  the  fact  that  a  vector  instruction  operates  uniformly  on  a  set  of  related  data, 
such  as  a  column  or  row  in  a  matrix,  to  produce  another  vector  of  data. 

There  have  been  many  commercially  successful  implementations  of  vector  archi¬ 
tectures,  beginning  with  the  load/store  architecture  of  the  Cray-1  in  1976  [110,  101,  8]. 
Since  then  Cray  Research  Incorporated  has  produced  a  succession  of  vector  maclunes:  th^e 
Cray  X-MP  in  1982,  the  Cray-2  in  1985,  the  Cray  Y-MP  in  1988,  and  the  Cray  Y-MP  C- 
90  in  1991  In  1983,  three  Japanese  vendors  entered  the  supercomputer  market  with  the 
Fujitsu  VP200,  the  Hitachi  S810,  and  the  NEC  SX/2  [85,  88,  116].  The  1980’s  also  saw 
less  costly  vector  implementations.  For  example,  in  1985,  Convex  produced  the  superimm- 
computer,  the  Cl,  and  now  offers  a  multiprocessor  version,  the  C2  [68,  19].  In  1986,  IBM 
introduced  the  System/370  vector  architecture,  a  family  of  vector  computers  desired  to 
cover  a  range  of  cost /performance  implementations,  the  first  being  the  3090  Vector  Facihty 
[16].  In  1988,  the  Ardent  Titan  entered  the  market  as  a  superworkstation  with  graphics  ca- 
pabiUties  [31  .  In  1989,  Digital  Equipment  Corporation  formally  introduced  the  VAX  6000 
Model  400  series,  which  extended  the  VAX  architecture  to  include  vector  processing  [102J. 
The  most  recent  development  occurred  in  1991  when  Thinking  Machines  Corporation  in¬ 
cluded  a  vector  execution  unit  as  part  of  the  processor  node  in  its  Connection  Machine-5, 
a  massively  parallel  processor  [109],  and  both  NEC  and  Fujitsu  have  fabricated  a  vector 
processor  on  a  single  VLSI  chip  [90,  64]. 

Because  only  one  instruction  is  issued  during  each  clock  period,  how  a  vector 
architecture  initiates  more  than  one  operation  per  clock  period  is  less  obvious  than  how 
this  is  accompUshed  by  the  other  classes  of  architectures.  Moreover,  a  vector  instruction 
can  take  a  long  time  to  execute  because,  in  a  load/store  architecture,  such  as  the  Cray  Y-MF, 
the  maximum  number  of  operations  executed  by  a  vector  instruction  is  equal  to  the  number 
of  elements  in  a  vector  register,  or  64  in  the  case  of  the  Cray  Y-MP.  Hence,  a  Y-M  vector 
instruction,  which  sequentially  initiates  its  operations,  can  take  over  64  clock  periods  to 
execute.  Two  vector  instructions  can  execute  concurrently,  however,  when  the  instructions 
use  completely  different  resources,  such  as  separate  functional  units  and  distinct  registers 
for  operands  and  results.  In  this  way,  two  operations,  one  from  each  instruction,  wiff 
initiate  simultaneously  even  though  the  instructions  still  issue  one  at  a  time  (see  Figure  2.2). 
Multiple  operations  will  also  commence  simultaneously  when  any  scalar  instructions  are 

issued  during  the  execution  of  a  vector  instruction. 

Vector  hardware,  unUke  VLIW  hardware,  provides  interlock  logic  for  its  registers. 
This  means  that  without  additional  hardware  a  register-dependent  vector  instruction  must 
wait  until  the  independent  vector  instruction  has  finished  using  the  common  register,  as  Fig¬ 
ure  2.3(a)  shows.  Two  hardware  mechanisms,  however,  can  be  used  to  eliminate  this  large 
loss  in  potential  parallelism.  If  two  vector  instructions  form  a  RAW  dependence,  chaining 
hardware  allows  the  dependent  instruction  to  begin  executing  as  soon  as  the  first  opera- 
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Figure  2.2:  Multiple  Operation  Initiation  with  Independent  Vector  Instructions 

This  execution  trace  illustrates  how  independent  vector  and  scalar  ^ 

initiation  of  more  than  one  operation.  A  scalar  instruction  causes  only  one  operation  to  execute 
while  a  vector  instruction  causes  multiple  operations  (64  in  this  example)  of  the  same  type  to  initiate 
"n  on,  tun«ion.l  uml.  The  number  ot  opernlione  executed  by  a  vecto,  ...ttuct.on  » 
typically  stored  in  a  vector  length  register,  called  VL  in  the  Cray  Y-MP^ 

A  vector  instruction  is  identified  by  its  use  of  vector  registers,  such  as  VO  or  V2,  ™ 

consists  of  multiple  registers.  (A  more  thorough  description  of  vector  registers  ‘s  Provided  m  th 
next  subsection.)^  The  vector  instruction  V0<-MCR1]  transfers  into  vector  repster  VO  data  ^om 

mclry  locations  starting  a.  th.  address  stored  in  The  ^ 

V3<-V1+V2  stores  into  the  t‘'‘  register  of  vector  register  V3  the  sum  of  the  i  registers  ot 

one  instruction  is  actually  issued  every  clock  period.  But  because  a  vector 
instruction  causes  multiple  operations  to  initiate  over  time,  the  overlapped  execution  of  vector 
instructions  allows  fine-grain  parallelism  to  occur. 
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tion  of  the  independent  instruction  has  finished  executing.  Figure  2.3(b)  gives  ^  example 
of  chaining.  On  the  other  hand,  with  tailgating  hardware,  a  WAR- dependent  instruction 
is  issued  immediately  following  the  independent  instruction  as  shown  in  Figure  2.3(c).  A 
WAR  dependence  is  avoided  because  register  reads  occur  near  the  beginning  of  a  pipehne 
and  register  writes  occur  at  the  end,  thus  guaranteeing  that  the  independent  operation 
reads  the  register  before  the  dependent  one  writes  into  it.  In  summary,  using  chaimng  or 
tailgating  allows  multiple  operations  to  initiate  simultaneously  in  the  presence  of  dependent 
vector  instructions.  Neither  of  these  approaches,  however,  affects  the  instruction  set  design; 
their  main  impact  is  to  improve  performance  by  increasing  the  opportunities  for  fine-gram 

parallelism.  ,  . 

Chaining  and  tailgating  hardware  were  first  implemented  in  different  Cray  com¬ 
puters.  Because  the  Cray-1  uses  single-ported  register  ceUs,  it  provides  a  Umited  form 
of  chaining;  a  RAW-dependent  instruction  can  begin  executing  within  a  certain  period  of 
time,  called  the  chain  slot  time,  after  the  independent  instruction  has  issued.  If,  for  some 
other  reason,  the  dependent  instruction  cannot  issue  within  the  chain-slot  time,  it  must 
wait  until  after  the  independent  instruction  has  finished  executing  completely.  By  using 
dual-ported  register  cells,  the  Cray  X-MP  and  Y-MP  implement  fully  flexible  chaining;  a 
RAW-dependent  instruction  can  begin  executing  any  time  after  the  first  operation  of  the 
independent  instruction  has  finished  executing.  Rather  than  using  chaining,  the  Cray-2 
implements  tailgating.  To  date,  there  is  no  vector  machine  that  implements  both  chaining 

and  tailgating. 

2.2.2  Multiple  Operands  and  Results 

The  second  and  fourth  tasks  in  the  basic  sequence  for  executing  an  operation  fetch 
and  store  data  for  an  operation.  Simultaneously  handling  multiple  operands  and  results  in 
a  load/store  architecture  requires  a  register  file  with  multiple  read  and  multiple  write  ports_ 
A  register  file  with  R  read-ports  and  W  write-ports  provides  the  capability  of  reading  R 
registers  and  writing  W  registers  during  the  same  clock  period.  Figure  2.4  shows  three 
distinct  configurations  for  implementing  a  multiported  register  file:  monolithic,  partitioned, 
and  distributed.  Combinations  of  these  types  of  register  files  are  also  possible.  In  this 
subsection,  in  addition  to  explaining  how  these  types  of  register  files  provide  simultaneous, 

multiple  access,  I  also: 

•  discuss  how  these  configurations  differ  by  examining  how  well  individual  registers  are 
connected  to  all  of  the  available  functional  units;  and 

•  give  examples  for  each  type  of  register  file.  Although  these  configurations  can  be 
used  with  any  of  the  architectural  approaches,  there  is  a  natural  tendency  for  an 
architectural  class  to  use  a  specific  configuration. 

The  most  straightforward  configuration  for  implementing  a  multiported  register 
file  is  to  use  a  register  cell  with  multiple  read-  and  multiple  write-ports.  Although  the 
number  of  registers  actually  accessed  is  determined  by  the  number  of  ports,  all  registers  in 
such  a  monolithic  register  file  are  available  simultaneously  as  an  operand  or  a  destination  for 
any  functional  unit.  This  type  of  register  file  is  also  known  as  a  shared  register  file  [105, 15]. 
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Execution  Traces  of  Dependent  Vector  Instructions 
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Figure  2.3:  Multiple  Operation  Initiation  with  Dependent  Vector  Instructions 

These  execution  traces  illustrate  how  chaining  and  tailgating  hardware  increase  parallelism  in 
the  presence  of  RAW  and  WAR  dependences  between  vector  registers.  The  above  chart  assumes 
that  the  latency  for  one  load  operation  is  four  clock  periods.  The  notation  for  vector  instructions  is 

explained  in  Figure  2.2. 
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Figure  2.4:  Types  of  Multiported  Register  Files 

This  figure  shows  three  different  configurations  for  providing  a  multiported  register  file  to 
support  muUiple  functional  units.  The  components  that  are  part  of  a  register  file  have  a 

When  implementing  a  register  file  with  a  given  bandwidth,  these  configurations  represent  a 
tradeoff  between  decreasing  area  cost  and  increasing  restrictions  on  accessibility  per  clock  period^ 
The  monohihic  configuration  uses  a  multiported  register  cell  that  increases  in  size  with  increasing 
bandwidth,  which  means  that  any  register  can  be  accessed,  even  multiple  times,  in  the  same  clock 
period  The  paritUoned  configuration  uses  a  dual-ported  register  cell  with  one  read  port  and  one 
write  port.  Although  the  register  banks  are  fully  connected  to  the  functional  units  by  a  pair  of 
multiplexors,  at  most  two  registers  from  each  register  bank  can  be  accessed  by 

clock  period  The  distributed  configuration  does  not  use  multiplexors  and  lacks  the  full  connectivity 
of  the  other  two  configurations.  Although  any  register  is  available  for  its  associated  functional  unit, 
an  explicit  transfer  is  needed  if  a  different  functional  unit  requires  access. 
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A  monolithic  register  file  is  used  by  most  superpipelined,  superscalar  and  VLIW  designs 
For  example,  the  floating-point  register  file  of  the  IBM  IlS/6000  has  4  read-ports  and  2 
write-ports  to  support  a  three-input  multiply-add  unit,  a  load  port  and  a  store  port.  The 
SUN  SuperSPARC  also  uses  registers  with  4  read-ports  and  2  wnte-ports  [ISJ.  A  register 
file  on  the  Multiflow  Trace  handles  four  reads  and  four  writes  at  once  to  support  a  floating¬ 
point  multiplier,  a  floating-point  adder  and  two  memory  ports  [23].  Whereas  these  register 
files  are  similar  in  organization,  the  Intel  iWarp,  in  contrast,  has  a  a  17-ported  register  ceU 


for  128  32-bit  words  [67]. 

An  alternative  configuration  for  providing  a  multiported  register  file  is  to  partition 
the  registers  into  banks  where  each  register  has  only  one  read-port  and  one  write-port 
a  configuration  which  I  call  a  partitioned  register  file.  This  configuration  is  used  in  vector 
architectures,  where  a  register  bank  is  more  commonly  known  as  a  vector  register  and  is 
comparable  in  organization  to  a  simple,  scalar  register  file.  Each  vector  register  consists 
of  many  dual-ported  registers  and  has  its  own  read  bus  and  write  bus.  Multiple  vector 
registers,  which  are  considered  coUectively  as  a  vector  register  file,  give  the  appearance  of 
a  register  file  with  multiple  read  and  write  ports.  A  register  file  that  is  partitioned  into  N 
banks  can  allow  N  accesses  to  occur  in  the  same  clock  period.  With  chaimng  or  tailgating 


hardware,  N  reads  and  N  writes  can  occur  simultaneously. 

When  compared  to  a  monolithic  register  file,  a  partitioned  one  provides  less  connec¬ 
tivity  between  any  individual  register  and  any  functional  unit.  Despite  being  partitioned, 
such  a  register  file  has  two  sets  of  multiplexors  that  provide  complete  flexibihty  in  con¬ 
necting  any  register  bank  with  any  functional  unit.  One  set  of  multiplexors  connects  he 
register-read  buses  to  the  input  buses  of  the  functional  units,  whUe  another  set  connects  the 
output  buses  of  the  functional  units  to  the  register-write  buses.  However,  not  all  registers 
are  available  simultaneously  as  an  operand  or  as  a  result.  Instead,  only  two  registers  per 
vector  register  are  available,  one  for  a  read  and  one  for  a  write,  during  each  clock  period.  In 
other  words,  a  vector  register  can  be  used  concurrently  by  at  most  two  vector  instructions: 
as  the  destination  for  one  and  the  source  for  the  other. 

As  Figure  2.5  shows,  the  number  of  registers  provided  in  vector  register  files  varies 
greatly,  mainly  because  the  number  of  registers  per  vector  register  spans  a  wide  range. 
Vector  processors  typically  have  eight  vector  registers;  eight  of  the  eleven  implementations 
listed  use  a  vector  register  file  that  can  have  eight  vector  registers.  In  contrast,  the  number 
of  registers  in  one  vector  register  ranges  from  4  registers  in  a  Thinking  Machine  pr^essor 
node  to  2048  registers  in  an  Ardent  Titan.  The  Thinking  Machines,  Ardent,  and  Fujitsu 
processors  are  different  in  that  the  vector  register  file  is  reconfigurable:  software  can  par¬ 
tition  the  register  file  into  any  number  of  vector  registers  where  the  number  of  registers 
per  vector  register  can  be  varied  as  long  as  the  total  number  of  registers  remains  constant. 
Reconfigurabihty  is  possible  because  a  vector  instruction  can  address  individual  elements 


of  a  vector  register. 

A  VLIW  machine,  the  Cydrome  Cydra,  also  uses  a  variant  of  a  partitioned  register 
file  called  a  multiconnect  in  Cydrome  terminology  [28],  which  demonstrates  that  a  type  of 
register  file  is  not  necessarily  confined  to  a  specific  architectural  class.  A  multiconnect  is 
sUghtly  different  from  a  vector  register  file  in  that  the  connectivity  between  registers  arid 
functional  units  in  the  former  is  more  restrictive  for  writes  and  less  restrictive  for  reads.  In 
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Figure  2.5:  Configurations  of  Vector  Register  Files  for  Commercial  Processors 

This  table  shows  configurations  of  register  files  for  eleven  commercial 
in  order  of  total  register  capacity  and  minimum  number  of  vector  registers.  All  configurations  store 

^^’^sVr^rof  the  implementations  have  special  features.  The  vector-register  length  in  the  3090 

can  vary  among  implementations,  providing  a  range  of  cost/performance  models^The  ' 

tions  SL  NEC  include  a  set  of  vector  registers,  whose  si^e  is  enclosed  -  ^^mage 

connected  directly  to  the  arithmetic  functional  units  and  are  used  mainly  for 
?he  Impkmentatlns  from  Thinking  Machines.  Ardent,  and  Fujitsu  provide  a  ^  f 

reconfigurable  in  software.  The  number  of  vector  registers  and  their  *512  fo  the 

cated  range  as  long  as  the  total  number  of  registers  remains  constant.  64  for  the  CM-5.  512 
VLSI  implementation  by  Fujitsu,  and  8192  for  the  Ardent  Titan  and  Fujitsu  VP200. 

With  the  exception  of  the  CM-5  processor  node,  which  is  part  of  a  massively  parallel  process©  , 
and  the  low-end  IBM  implementations,  the  total  number  of  registers  provided  in  a  vector  register 
file  ranges  from  256  to  8192,  much  larger  than  the  typical  32  registers  used  in  today  s  scalar  and 

superscalar  architectures. 
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a  multiconnect,  a  register  bank  can  only  receive  results  produced  by  a  particular  functional 
unit;  hence,  the  number  of  register  banks,  or  partitions,  is  equal  to  the  number  of  functional 
units  connected  to  such  a  register  file.  In  contrast,  not  only  can  auy  register  bank  dehver 
an  operand  to  any  functional  unit,  but  each  register  bank  can  deliver,  in  the  same  dock 
period,  operands  to  all  functional  units.  To  provide  this  functionality,  each  repster  b^ 
must  have  as  many  read  ports  as  there  are  inputs  to  functional  units;  to  implement  t^his 
in  the  Cydrome  Cydra,  a  bank  uses  dual-ported  registers  that  are  rephcated  rather  than 
registers  with  multiple  read-ports.  In  the  Cydra,  there  are  64  registers  per  register  bank 
and  six  functional  units,  for  a  total  of  384  registers  that  can  store  umque  data.  Because 
a  register  bank  is  rephcated  to  provide  increased  accessibihty  to  its  registers,  the  physi 
implementation  of  this  multiconnect  requires  six  times  as  many  registers. 

Finally,  a  configuration  different  from  the  monohthic  and  partitioned  ones  is  a 
distributed  register  file  (although  the  definition  of  a  register  file  becomes  somewhat  vague 
at  this  point).  In  this  configuration,  the  registers  are  divided  into  sets  where  se  is 
connected  to  its  own  functional  unit(s).  A  shght  variation  of  a  distributed  register  file  is  a 
split  register  file  where  each  register  set  has  a  specific  purpose  such  ^  “  '“^egM  regis  er 
set  or  a  floating-point  one  [105,  15].  The  VLIW  architecture,  the  Multiflow  IVace  has 
a  distributed  register  file  where  each  register  set  is  actually  a  monohthic  one,  which  was 
described  earher.  A  distributed  register  file  is  also  used  in  the  vector  processor  node  for 
Thinking  Machine’s  CM-5.  The  number  of  register  sets  can  vary  from  four  in  the  CM-5 
processor  to  7,  14,  or  28  in  a  Multiflow  Trace,  depending  upon  the  model. 

Unlike  the  partitioned  register  file,  a  distributed  one  is  more  restrictive  in  its 
connectivity  between  itself  and  any  functional  units.  If  a  value  that  is  stored 
set  is  needed  for  a  different  group  of  functional  units,  the  value  must  first  be  transferred 
to  the  appropriate  register  set.  This  transfer  is  often  made  through  the 
making  this  configuration  similar  to  the  separate  register  files  in  a  multiprocessor.  Without 
a  special  algorithm  for  assigning  registers,  up  to  50%  of  performance  could  be  lost  due  to 
excessive  data  transferring  [41]. 

2.2.3  Multiple  Operation  Execution 

The  third  task  in  the  basic  sequence  for  carrying  out  an  operation  is  to  actu^y 
execute  it.  Executing  multiple  operations  at  the  same  time  can  be  accomplished  with  a 
pipelined  functional  unit  or  multiple  functional  units  or  both.  When  multiple  f'^^^tion 
units  are  used,  they  can  be  general  purpose  or  special  purpose  (e.g.,  a  floatingpoint  adder) 
or  somewhere  in  between  (e.g.,  a  floating-point  unit  or  an  integer  umt).  The  mam  ad¬ 
vantages  of  a  general  purpose  functional  unit  are  fewer  unique  components  to  design  and 
simpler  algorithms  for  scheduling  instructions  in  a  compiler.  The  main  advantage  of  spe¬ 
cial  purpose  functional  units  is  a  faster  implementation  of  hardware  for  specific  types  of 

operation^^^^^^  architectures  typically  implement  multiple,  fuUy-pipeUned,  special  purpose 
functional  units.  SuperpipeUned  designs  also  use  multiple, 

although  these  are  not  always  pipelined;  the  functional  units  of  the  CDC  6600  are  not 
pipeUned  whereas  those  of  the  CDC  7600  and  the  Cray  machines  are.  Superscdar  designs, 
such  as  the  IBM  RS/6000,  Metaflow  Lightning,  and  Sun  SuperSPARC,  use  a  fully-pipeline  , 


17 


floating-point  unit  and  integer  unit  as  weU  as  a  special  purpose  unit  for  h^dling  branch 
operations  VLIW  designs  have  used  various  combinations:  the  Cydrome  Cydra  uses  spea 
U  pipelined  functional  units;  the  Multiflow  TVace  uses  sevetal 
LLg-poinl  and  integer  nnits;  and  the  Intel  iWaip  uses  non-p.pehned,  speaal  purpose 

In  addition  to  providing  multiple  functional  units,  hardware  also  provides  a  map¬ 
ping  between  the  functionality  of  a  given  operation  to  any  functional  nmt 

that  functionality.  Changing  the  conflguration  ““totTitrurtion  s!t 

ping-  this  is  part  of  a  hardware  implementation  and  is  independent  from  instruction  se 

Lsfgn  Thus,  of  the  four  tasks,  this  is  the  most  decoupled  from  the  design  of  an  instruction 

tt,TndLre  is  no  inherent  reason  why  special  purpose  or  general  purpose  functional  units 

should  be  used  in  one  architecture  and  not  another. 


2.3  Vectorization 

In  this  section,  I  describe  how  a  vectorizing  compiler  transforms  a  source 
that  is  written  for  a  scalar  processor  into  code  containing  vector  instructions.  An  u 
dtstanding  of  this  software  procedure,  known  as  vectorization,  is  important  for  hardware 
designers  because  vectorization  facilitates  the  effective  use  of  a  vector  ^ 

Quentlv  this  description  is  not  intended  to  be  a  thorough  examination  of  the  research  issues 
concerLg  vectorization  but  rather  a  tutorial  that  outUnes  the  main  aspects  of  vectorization 
for  those  who  are  more  familiar  with  hardware  design. 

Because  vector  hardware  imposes  some  restrictions  on  what  can  be  ' 

vector  insttuctiout.  not  all  parts  of  a  program  can  be  translated  mto  vectored 
vector  hardware  as  the  motivating  factor,  1  first  present  the  properties  of  a  vectonaaWe 
urogram  fragment.  I  next  outUne  how  a  vectorizing  compiler  identifies  these  properties 
Ld  generates  vectorized  code.  This  last  part  also  describes  how  using  vector 
is  coLeptually  similar  to  using  “loop  unrolling”,  an  optimization  technique  used  m  scalar 

compilation. 

2.3.1  Properties  of  A  Vectorizable  Program  EVagment 

Not  all  parts  of  a  program  can  be  translated  into  vector  instructions.  In  this 
subsection,  I  show  that  a  program  fragment  can  be  translated  into  vectorized  code  if  it  has 
the  following  properties: 

1.  it  is  a  loop; 

2.  it  has  at  least  one  variable,  called  an  aggregate  vanable,  that  can  reference  different 
memory  locations  (e.g.,  an  array); 

3.  the  memory  accesses  of  each  aggregate  variable  form  an  arithmetic  progression;  and 

4.  any  statement  that  contains  an  aggregate  variable  cannot  depend  on  itself. 

Because  it  is  a  compiler’s  job  to  identify  these  properties  many  Presentations  have  been 
made  using  a  simplified  model  of  vector  execution  as  seen  by  a  compiler.  I  take  a  differen 
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approach  by  explaining  how  a  hardware  implementation  of  a  vector  instmction  con.ribn.es 

,0  these  ^  „f  eonditionai  statements  and  array  refer^ 

ences  with  nested  indices,  do  no.  prevent  vecorization  as  long  as  ^ 

,ardware.r«ch_^^^ 

needed.  I  will  remove  this  simplifying  assumption  m  Section  2.  .  w  en 
generation  of  vectorized  code. 

Property  1:  Loop 

repea.e.^^r:rr':rr^:=s:rr^^ 

vector  instruction.  For  example,  in  the  following  loop 

DO  10  I'l.N 
10  A(I)»A(I)+B(I) 

“zr.— 

it  is  possiWe  to  express  vecorirable  computations  in  a  full  -unrobed”  fashron.  For  example, 
the  Mowing  two  program  fragments  express  the  same  computation. 

C(l)  =  A(l)  +  BCD 
C(2)  *  A(2)  +  B(2) 

DO  10  I  =  1.5  ^  C(3)  =  a(3)  +  B(3) 

10  C(I)  *  A(I)  +  BCD  ^.(4)  _  j^(4)  +  b(4) 

C(5)  =  A(5)  +  B(5) 

szr;s:  ?.  5“—=  r  rss  — : 

zx  — s;  ,rvss  rr  ,■=. 

could  be  interspersed  among  other,  less  related  statements. 

Property  2:  Aggregate  Variables 

The  second  property  of  a  vectorirahle  program  fragment  is  that  it  m^t  cont^  at 
least  one  variable  that  can  reference  a  different  location  on  each  .tera^tion  This  .s  ton^ 
operation  in  a  vector  instruction  accesses  a  different  storage  location.  I 
I  triable  an  aggregate  variable  because  it  may  reference  a  number  of  different  storage 
locations.  For  instance,  in  the  following  loop: 
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DO  20  1*1, N 
A=A+21 

20  B(I)=B(I)+21 

the  airey  reference  B(I)  is  an  aggregate  variable  whereas  the  scalar  variable  *  “ 
examples  are  pointers  and  procedure  parameters.  Just  as  a  vector  instruction  is  used  to 
execute  a  statLent  in  a  loop,  a  vector  register  stores  the  values  of  ^  aggregate  variable. 
Fm  e^a^ple,  the  element  of  the  array  B(I)  is  stored  in  the  i<‘  register  of  a  vector 

register. 

Property  3:  Arithmetic  Progression  for  Memory-access  Pattern 

The  third  property  of  a  vectorizable  statement  is  that  Ae  memory  imcesses  of 
each  aggregate  variable  form  an  arithmetic  progression  [101].  TUs  is  because  the  vec  or 
aMr^fgeLator  is  only  capable  of  addition.  As  a  counterexample,  the  access  pattern  for 

B(J)  in  the  following  loop 

DO  10  1*1, N 
J*J*2 

10  A(I)=A(I)+B(J) 

is  a  geometric  progression  requiring  multipUcation  for  successive  addresses.  “ 

not  aU  that  restrictive  because  array  references  in  most  programs  proceed  m  an  arithmetic 

sequence.^^^^  property  does  not  completely  describe  all  aggregate  variables  that  can  be 
vectorized  For  example,  the  access  pattern  of  an  array  reference  with  a  nested  index,  such 
as  A  (BCD),  can  be  completely  random.  However,  because  calculating  the  addresses  uses 
only  addition  and  the  B  addresses  stiU  form  an  arithmetic  progression,  this  reference  can  be 
vectorized  if  gather/scatter  hardware  is  provided. 

Property  4:  No  Self-dependent  Statements 

Finally,  the  fourth  property  of  a  vectorizable  loop  is  that  any  statement  that  con¬ 
tains  an  aggregate  variable  cannot  depend  on  itself.  Although  it  is  not  imme^ately  obvious 
the  reason  for  this  property  is  that  vector  instructions  do  not  interleave  the  execution 
their  operations.  For  example,  two  vector  instructions  that  use  the  same 
execute  serially  rather  than  alternating  the  execution  of  their  operations.  At  the  begi 
ning  of  this  chapter  (in  Section  2.1),  I  stated  that  any  architecture  that 
parlllelism  must  deal  with  data  dependences.  This  fourth  property  shows  the  relationship 

between  dependences  and  a  vector  architecture. 

Figure  2  6  shows  how  a  vectorizable  statement  can  directly  or  indirectly  depend 

on  itself  3  Direct  self-dependence  is  often  simple  enough  to  vectorize  with  either  special 
hardware  or  a  major  software  transformation.  In  addition,  there  are  other  special  caaes  of 
self-dependence  that  can  be  vectorized;  I  will  give  more  details  about  these  shortly.  More 

^Statements  involving  only  scalar  vTrilbles,  such  as  A-A+5,  are  also  self-dependent.  However  because 
scalar  self-dependences  do  not  prevent  vectorbation.  I  exclude  them  from  this  Discussion.  For  the  sake 
brlty  I  use  the  term  self.dependent  statement  to  mean  one  that  involves  aggregate  variables. 


20 


I  make  .  distinction  between  a  statement  and  ^  ‘ 

loop  contains  one  instance  of  each  statement.  For  example,  for  the  following  p 

DO  10  1*1, N 
5  A(I)=B(I)*C(I) 

10  B(I+2)*A(I)+T 

*«).B(=).C(2)  is  the  second  instance  of  the  statern..  ^ 

order  specified  by  a  self-dependent  the  following  table 

to  the  same  memory  location  occur.  For  instance,  lor  the  p 

summarizes  the  order  of  accesses  to  the  two  memory  locations  Ad)  and  B(3). 


memory 

LOCATION 

A(l) 

B(3) 


FIRST  REFERENCE 
first  instance  of  Ss 
first  instance  of  S\o 


SECOND  REFERENCE 
first  instance  of  Sio 
third  instance  of  S5 


dependent  statements,  Sv  and  Sun  „  r,f  c  must  execute  before  some  instance 

interleaved  fashion.  In  other  words,  some  instance  of  5.  must  execute  Deiore 

“HoteveTuting  vector  instructions  to  execute  self-dependent  statements  in  a  loop 

-e  -  —  --  ‘Vlet= 

of  a  statement,  the  seri^  execuno  ^  statement. 

+  Ko  iispH  to  execute  any  statement  that  indirectly  depends  on  itself, 
cannot  be  -ed  to  exec  f  ■'‘T^TVor" 

that  directly  depends  upon  itself  is  often  a  special  case  that  can  be  vectorized.  For  mstance, 

the  following  loop 

DO  10  1*1, N 
10  A=A+B(I) 

of  a  scalar  reduction,  which  is  a  function  that  reduces  a  vector  of  data  to  a 
IciL  valu^;  other  examples  of  scalar  reductions  are  minimum  and  maximum  opera  10ns. 
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NO 

SELF-DEPENDENT 

STATEMENTS 

DIRECTLY 

SELF-DEPENDENT 

STATEMENT 

INDIRECTLY 

SELF-DEPENDENT 

STATEMENTS 

DO  10  1=1,1 

8  A(I+1)=B(I)+X(I) 

9  B(I)=A(I)*S 

10  C(I)=C(I)*Y(I) 

DO  20  1=2,11 

20  A(I)=A(I-1)+X(I) 

DO  30  1=2,1 

28  A(I)=B(I)+X(I) 

29  C(I)=A(I)*S 

30  B{I+1)=C(I)*Y(I) 

Figure  2.6:  Self-dependent  Statements 

These  loops  contain  examples  of  self-dependent  statements.  Recall  from  Section  2.1  that  a 
dependence  specifies  the  order  in  which  two  references  that  access  the  same  memory  location  must 
execute  in  order  to  guarantee  correctness.  To  more  easily  recognize  when  self-dependence  occurs,  a 
dependence  is  associated  with  the  statements  that  contain  the  two  references.  It  is  ^so  necessary  to 
distinguish  between  instances  of  a  statement;  one  iteration  of  a  loop  contains  one  instance  of  each 
statement  in  the  loop.  A  statement  is  self-dependent  if  instances  of  that  statement  must  be  executed 
in  succession  for  the  associated  program  to  execute  correctly.  Moreover,  a  statement  can  be  either 

directly  or  indirectly  self-dependent.  •  r  * 

A  statement  Sw  directly  depends  on  another  statement  Sv  if  S^i  and  Sv  contain  references  that 
access  a  common  storage  location  and  Su,  accesses  the  location  after  Sv  does.  For  example,  in  the 
loop  labeled  10,  statement  9  directly  depends  on  statement  8  because  statement  9  accesses  A(2) 
after  statement  8  does.  This  order  of  access,  in  fact,  is  true  for  all  elements  of  the  arrays  A  and  B.  A 
statement  is  directly  self-dependent  if  it  contains  references  that  access  the  same  location  but  in 
different  iterations  because  in  order  to  maintain  the  order  specified  by  the  dependences  among  these 
references,  the  instances  of  must  execute  in  succession.  For  example,  the  statement  in  the  loop 
labeled  20  is  self-dependent  because  the  two  references  A(I)  and  A(I-l)  access  the  same  memory 
locations,  albeit  in  different  iterations.  On  the  other  hand,  statement  10  in  the  loop  labeled  10  is 
not  self-dependent  even  though  it  contains  two  references  to  C(I). 

Dependence  and  self-dependence  are  also  transitive.  A  statement  5„,  indirectly  depends  on  an¬ 
other  statement  5v  if  5„,  depends,  directly  or  indirectly,  on  a  statement  which  in  turn  directly  or 
indirectly  depends  on  S„.  For  example,  in  the  loop  labeled  30,  statement  30  indirectly  depends 
on  statement  28  because  statement  30  directly  depends  on  statement  29  because  of  the  references 
to  C(I),  and  because  statement  29,  in  turn,  depends  on  statement  28  because  of  the  references  to 
A  (I)  Note  that  indirectly  dependent  statements  do  not  necessarily  access  the  same  memory  loca¬ 
tion.  A  statement  5„,  is  indirectly  self-dependent  if  depends,  directly  or  indirectly,  on  another 
statement  Sv  which  in  turn  directly  or  indirectly  depends  on  S^.  Moreover,  any  statement,  such 
as  Sv,  that  is  part  of  these  indirect  dependences  is  also  indirectly  self-dependent.  For  example,  in 
the  loop  labeled  30,  statement  28  indirectly  depends  on  itself  because  it  directly  depends  on  state¬ 
ment  30  because  of  the  references  B(I+1)  and  B(I),  and  because  statement  30  indirectly  depends 
on  statement  28.  In  fact,  each  statement  in  this  loop  is  self-dependent. 

The  presence  of  a  self-dependence  is  signaled  by  a  dependence  whose  references  occur  in  different 
iterations.  Such  a  dependence  is  called  a  loop-carried  one.  In  contrast,  a  dependence  whose  references 
occur  in  the  same  iteration  is  called  loop-independent.  Although  a  self-dependence  contains  at  least 
one  loop-carried  dependence,  not  all  loop-carried  dependences  are  part  of  a  self-dependence.  For 
example,  the  loop  labeled  10  has  a  loop-carried  dependence  but  no  self-dependent  statements. 


SOURCE  CODE 

DO  10  1=1. N 

A(I)=B(I)+C(I) 
10  b(i)=a(i)+d 


- execution  ordek 

SCALAR  VERSION  I  VECTOR  VERSION 


A(1)=B(1)+C(1) 

B(1)=A(1)+D 

A(2)=B(2)+C(2) 

B(2)=A(2)+D 

A(3)=B(3)+C(3) 

B(3)=A(3)+D 


A(1)=B(1)+C(1) 

A(2)=B(2)+C(2) 

A(3)=B(3)+C(3) 

B(1)=A(1)+D 
B(2)=A(2)+D 
I  B(3)=A(3)+D 


Figure  2  7;  Execution  Orders  When  Using  Scalar  and  Vector  Instructions 

is  «he„  usmg  in  th.  loop  -  betose  ^ocuting  the  next 

Iteration  of  a  loop  •  ,  techniques  such  as  loop  unrolling  or  software  pipelining).  The 

iteration  (ignoring  scalar  optimizatio  q  easiest  to  see  when  two  vector  in- 

ailference  in  execution 

structions  use  the  same  functional  .  f  successive  operations  of  a  vector  instruc- 

a  statement  before  executing  all  instances  of  the  next  statement.  _ _ 
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Either  hardware  or  software  techniques  can  be  used  to  vectorize  such  functi^ons  As  an 
example  of  the  former,  the  IBM  System/370  vector  architecture 

operations  accumulate,  minimum,  and  maximum  in  its  instruction  set  [16].  The  Cray-1  ^so 
provides  some  hardware  support  for  computing  scalar  reductions  as  an  unexpected  benefit 
using  single-ported  vector  registers  [65].  Alternatively,  a  scalar  reduction  can  be  vectonzed 
using  an  algebraic  transformation  in  software  without  any  special  hardware  support  [92, 
page^382].  For  example,  a  summation  can  be  vectorized  as  follows  (The  notation  for  vector 
instructions  is  explained  in  Figure  2.2  on  page  10.). 

R1  <-  address  of  A(l) 

R2  <-  address  of  A(l)  +  N 

DO  10  I=1.N  TOP:  VI  <-  M[R1] 

10  A=A+B(I)  VO  <-  VO  +  VI 

R1  <-  R1  +  8 
BRNE  R1.R2,T0P 

The  arithmetic  laws  of  commutativity  and  associativity  are  used  to  separate  the  sum  into 
n  partial  sums  that  can  be  vectorized,  where  n  <  N.  Because  floating-point  arithmetic  is 
not  associative,  this  transformation  can  compute  a  value  that  is  Afferent  from  ^^at  ^he 
original  code  computes.  The  transformed  code  on  the  right-hand  side  ^s  the  v^torized 
version  where  the  vector  length  is  8,  the  vector  register  VO  contains  the  n  -  8  partial  sums 
and  N  is  assumed  to  be  a  multiple  of  8.  The  final  sum  is  obtained  by  using  scalar  code  to 

add  the  partial  sums  that  are  stored  in  VO. 

Another  example  of  a  direct  self-dependence  is  ^first-order  linear  recurrence: 

DO  20  1=1, N 

20  A(I)=A(I-1)*B(I)  +  C(I) 

This  function  is  given  its  name  because  the  expression  on  the  right-hand  side  is  a  linear 
function  that  uses  a  value  from  the  previous  iteration.  By  extension,  an  n  -order  bnear 
recurrence  uses  a  value  from  the  n‘'*  previous  iteration.  Again,  either  hardware  or  soft¬ 
ware  techniques  can  be  used  to  vectorize  such  functions.  The  Hitachi  S-810  ^ 

macro-vector  instruction  called  VITR  that  executes  first-order  bnear  recurrences  [114  .  Al¬ 
ternatively,  an  algebraic  transformation  similar  to  the  one  for  scalar  reductions  can  vectorize 
such  functions  using  basic  pair-wise  vector  instructions  [103]. 

In  addition  to  the  above  self-dependences,  there  are  other  special  ones  that  can  be 

vectorized.  For  example,  the  Mowing  anti- dependence 


DO  10  1=1, N 
10  A(I)=A(I+1)*B(I) 

forms  a  self-dependence  that  can  be  vectorized  because,  on  most  modern  vector  implemen¬ 
tations  the  fetch  of  the  A  elements  will  occur  before  the  store.  Another  self-dependence, 
consisting  of  both  a  flow  dependence  and  an  anti-dependence  on  the  B  elements,  can  be 
vectorized  by  separating  the  loop  into  two  parts,  one  with  only  the  flow  dependence  and 
the  other  with  only  the  anti-dependence: 
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DO  5  I»l,N/2 

A(I)=B(N-I+1)+C(I) 

DO  10  1=1. N  5  B(I)=D(I)*T 

A(I)=B(N-I+1)+C(I)  PQ  JO  I=N/2+1,N 

10  B(I)=D(I)*T  A(I)»B(N-I+1)+C(I) 

10  B(I)=D(I)*T 

As  a  last  example,  a  major  software  transformation  vectorizes  array  references  with  nested 
indices: 

DO  10  1=1, N 

10  A(  K(I)  )  '  A(  K(I)  )  +  B(I) 

Because  the  index  values  for  the  A  references  are  not  known  at  compUation  time,  this 
statement  must  be  assumed  to  be  self-dependent.  Nonetheless,  using  gather/scatter  meinory 
iLtructions  the  Cray  Research  compiler,  cft77  version  5.0,  is  able  to  vectorize  such  a  loop 
and  still  pre;erve  any'dependences  that  may  exist.  Michael  Wolfe 

for  vectorizing  in  the  presence  of  certain  self-dependences  [120,  pages  64-67  of  Chapter  3J. 
for  of  sel(.d.p=ntoce.  tUt  ^  be 

vectorized,  a  self-dependent  statement  normally  prevents  vectorization  because  the  depen¬ 
dences  force  instances  of  statements  from  different  iterations  to  execute  in  -  -te^ 
fashion.  The  absence  of  self-dependent  statements  imposes  no  restrictions  on  the  order  m 
which  statements  from  different  iterations  can  execute;  only  those  from  the  same  ite^  i 
must  execute  according  to  a  partial  ordering  ba^ed  on  intra-iteration  dependences.  This  re¬ 
sults  in  statements  that  can  be  ordered  in  such  a  way  that  all  dependences  will  be  preserve 
when  vector  instructions  are  used. 

2.3.2  Generating  Vectorized  Code 

Whereas  vector  hardware  determines  the  properties  of  a  vectorizable  Prog^m 
fragment,  a  vectorizing  compiler  is  responsible  for  identifying  these  properties  and  then 
generating  the  appropLte  mixture  of  vector  and  scalar  instructions  that  wiU  execute  a 
vectorizable  progrL  fragment.  Using  this  mixture  of  vector  and  scalar  instructions  to 
execute  a  loop  is  conceptually  similar  to  using  loop  unrolling,  an  optimization  technique  used 
in  scalar  compilation.  To  highlight  this  similarity,  I  outline  in  the  following  ^ 

a  vectorizing  compiler  identifies  a  vectorizable  program  fragment  and  generates  vectorized 

For  the  Mowing  discussion,  I  distinguish  between  vectorizable  and  non-vectorizable 
operations:  the  former  are  translated  into  vector  instructions  and  the  latter  into  s^alw  in¬ 
structions.  In  a  vectorizable  loop,  operations  that  either  have  an  aggregate  v^^^ble  “ 
input  or  generate  an  aggregate  variable  as  output  are  vectorizable  operations;  all  other  op- 
eraulns  fre  non-vectoriLbl  ones.  For  example,  the  loads  multiplication,  ^d  store  m  the 
statement  A(I)=B(I)*C(I)  are  vectorizable  operations  whereas  the  three  additions  for 

ufotions  .re  „o„.vecto,ir.ble  OBes.  Other  example,  of  non-veetorizable  operatons 
include  loop-index  calculations,  branch  comparisons,  and  exphcit,  scalar  self-dependen 

such  as  ^  vectorizable  program  fragment  is  a  simple  matter  of  identifying  a 

program  fragment  that  has  the  four  properties  described  in  the  previous  subsection.  T  e 
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first  three  properties  are  easily  identified.  Loops  can  be  identified  by  the  semantics  of  a 
language.  For  example,  in  FORTRAN,  the  DO  construct  signifies  a  loop  whereas  in  C,  the 
for  and  while  constructs  do.  Because  other  constructs,  such  as  if  ...  goto,  can  also  form 
a  loop,  a  more  semantic-independent  methodology,  that  is  based  on  flow  graphs,  can  be  used 
to  identify  loops.  A  flow  graph,  which  is  normally  constructed  for  sc^ar  optimization,  is  a 
directed  graph  that  represents  the  control  flow  of  a  program;  a  loop  is  merely  a  sub-graph 
with  a  special  structure  within  a  flow  graph.  Aho,  Sethi,  and  UUman  give  algorithms  for 
constructing  flow  graphs  and  identifying  loops  in  them  [2,  Chapter  10].  The  second  property, 
the  presence  of  aggregate  variables,  is  easily  determined  by  examining  the  type  of  a  variable; 
that  is,  whether  a  variable  is  a  scalar,  an  array,  or  a  pointer.  The  third  property,  memory 
accesses  that  form  an  arithmetic  sequence,  can  be  determined  by  examimng  the  use-def 
chains,  which  are  built  for  scalar  analysis  (4,  2],  to  identify  variables  (typically  indices  of 
arrays)  that  are  used  to  compute  the  addresses  of  an  aggregate  variable  and  check  that  their 
computations  involve  only  additions  or  subtractions.  For  the  purposes  of  code  generation, 
this  analysis  can  also  extract  more  information,  such  as  the  value  of  memory  offsets,  about 

the  access  pattern  of  an  aggregate  variable. 

Identifying  the  fourth  property,  the  absence  of  self-dependent  statements,  proceeds 
in  two  steps.  The  first  is  to  construct  a  dependence  graph,  which  is  a  directed  graph  that 
represents  the  dependence  relations  of  a  program.  In  such  a  graph,  a  vertex  is  a  statement 
or  operation  depending  upon  the  level  of  detail  desired,  and  there  is  an  edp  from  one 
statement  to  another  if  the  second  statement  accesses  a  common  memory  location  after  the 
first  statement  does.  Because  a  dependence  graph  is  used  to  identify  vectorizable  loops, 
only  vectorizable  statements  or  operations  are  represented  in  the  dependence  graph  for  a 
vectorizing  compiler.  For  example,  in  the  following  loop 

DO  10  1=1, N 
5  B=B+21 

10  A(I)=A(I)*8 

statement  5io  would  be  included  in  the  dependence  graph  but  statement  Ss  would  not. 
Similarly,  the  operations  for  loop  overhead  and  address  calculation  would  not  be  part  of  the 
graph. 

Constructing  a  dependence  graph,  or  more  specifically,  identifying  dependences 
that  exist  among  vectorizable  statements,  is  not  trivial.  Unlike  the  other  functions  I  have 
described  above  for  a  vectorizing  compiler,  identifying  dependences,  a  process  also  known  as 
dependence  analysis,  is  necessary  for  any  compiler  that  is  generating  parallel  code  because 
knowing  where  the  dependences  are  is  the  key  to  correct  functionality.  Consequently,  much 
research,  past  and  present,  has  been  spent  in  this  area  not  only  for  vectorizing  compilers  but 
also  for  more  general,  paralleUzing  compilers.  There  are  two  major  methods  for  detecting 
dependences.  The  more  established  method  operates  only  on  arrays.  Because  an  array 
has  an  explicit  addressing  mechanism,  detecting  a  dependence  between  two  statements 
is  a  matter  of  solving  an  algebraic  equation  constructed  from  the  address  functions  of 
the  associated  array  references  [120,  5,  83].  A  newer,  graph-theoretic  technique,  which  is 
based  on  data-flow  analysis,  is  aimed  at  Unked-list  data  structures  that  use  pointers  with 
implicit  addresses  [79,  56].  Because  the  early  vector  implementations  were  used  for  scientific 
programs  that  modeled  physical  phenomena  in  a  discrete  fashion  that  is  a  good  match  for 
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the  .rr.y  data  strac.are,  vectorizing  compilers  typically  use  the  algebraic  method. 

Two  critical  aspects  of  any  dependence  analyzer  are  effiaency  and  accuracy.  E 

“5  ”  "r  r» 

thermore,  although  efficiency  is  o  ^  array-based  dependence  analysis  can 

have  shown  that  a  suite  of  selected  benchmark  set  of  13  scientific 

accurately  identify  all  dependences  in  the  PERFLLl  CluD  Dencnmai 

Trograms  and  add^nly  an  average  of  about  3%  t^ 

Once  a  dependence  graph  is  constructed,  the  second  step  to  identifying  s 
dependent  statentents  is  to  determine  whether 

any  statement  to  itself.  Any  .d  yl  inTde  ”^“00  graph, ’^and 

is  called  a  dependence  cycle  because  it  forms  a  ^  P  vectorizable 

the  statements  on  that  path  are  said  to  form  a  dependence  dependence 

loop  is  one  whose  dependence  grap  as  no  acvcfic  oraph  When  compared  with 

graph  is  also  known  as  a  dag,  an  acronym  for  jfirected  acychc  ^aph  wnen  p 

constructing  a  dependence  graph,  finding  eye  es  ^  ®P®”  Xarian  [120  page  57]. 

value  produced  by  an  operation.  *•  r  Althoueh  these  tasks  are  also 

sent  the  functionality  of  the  vectorizable  por  ion  o  ^  P-  ^ 

performed  'ot  time  of  u  iLp. 

because  execution  of  P  functionality  but  more  for  performance  reasons. 

‘res'r^m’e— 

Lsi^iurr^ 

thereby  prov.iug  ,„a  depeudences  reduce  the  op- 

p:;"to  sldule't  parallelism.  Reassigumeu.  of  the  registers  wiU  help  alleviate  thrs 
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restriction.  In  the  Cray  Research  compiler,  scheduling  is  done  before  register  assignment 
but  the  scheduler  attempts  to  minimize  register  usage. 

Although  these  two  tasks  traditionally  have  been  performed  separately,  research 
in  scalar  algorithms  are  examining  techniques  that  allow  for  interaction  between  these  func¬ 
tions  [50,  15].  Based  on  information  from  the  register  assigner,  the  scheduler  alternates 
between  two  scheduling  strategies.  The  initial  goal  of  the  scheduling  algorithm  is  to  mim- 
mize  execution  time.  To  avoid  exceeding  the  register  limitation  of  the  processor,  the  register 
assigner  is  used  to  inform  the  scheduler  about  the  register  usage  of  its  schedule  while  the 
schedule  is  being  constructed.  Once  a  threshold  has  been  exceeded,  the  scheduler  chanps 
strategies  to  reduce  register  usage  until  registers  are  no  longer  a  critical  resource,  at  which 
point  the  scheduler  switches  back  to  the  initial  strategy.  Such  a  technique  is  being  adapted 
in  future  versions  of  the  Cray  Research  compiler. 

In  addition  to  scheduling  and  assigning,  a  vectorizing  compiler  translates  oper¬ 
ations  into  vector  and  scalar  instructions.  However,  whereas  the  translation  from  a  vec- 
torizable  operation  to  a  vector  instruction  is  straightforward,  the  translation  from  a- 
vectorizable  operation  to  a  scalar  instruction  requires  some  explanation.  First,  it  should 
be  noted  that  a  vectorizable  operation  produces  a  different  result  every  iteration  and  is 
translated  into  a  vector  instruction  that  executes  n  instances  of  that  operation.  As  with 
a  vectorizable  operation,  a  non-vectorizable  one  that  is  self-dependent,  such  as  A=A+1  and 
XsX+21,  produces  a  different  result  every  iteration  and  is  translated  into  one  or  more  scalar 
instructions  that  emulate  the  execution  of  n  iterations  of  that  self-dependence.  The  exe¬ 
cution  of  n  iterations  of  a  scalar  self-dependence  can  be  expressed  as  a  function  of  n  that 
requires  fewer  instructions  to  execute  than  n  instances  of  the  corresponding  scalar  instruc¬ 
tion.  For  example,  the  scalar  instruction  for  executing  the  operation  A®A+1  n  times  is 
just  RK-Rl+RO,  where  register  R1  holds  the  value  of  A  and  register  RO  holds  the  value  n.  A 
slightly  more  complex  operation,  such  as  X=X+21,  requires  two  scalar  instructions  to  emulate 
executing  it  n  times:  R2<~21*R0  and  R1<”R1+R2. 

A  technique  called  stripmining  generates  code  to  execute  a  loop  in  which  the 
number  of  iterations  executed  exceeds  the  number  of  registers  in  a  vector  register.  To  reduce 
the  number  of  parameters  in  the  following  discussion,  I  assume  the  number  of  registers  in 
a  vector  register  to  be  64  (based  on  the  Cray  Y-MP).  Stripmining  executes  such  a  loop  in 
strips  where  each  strip  executes  64  or  fewer  iterations.  When  the  number  of  iterations  N 
is  not  a  multiple  of  64,  one  of  the  strips  executes  N  mod  64  iterations. 

The  easiest  type  of  loop  to  stripmine  is  one  in  which  the  value  of  N  is  known 
without  having  to  execute  the  loop,  such  as: 


DO  10  1=1, N 
10  A(I)=0 


There  are  two  ways  to  implement  the  code  that  controls  the  execution  of  the  strips  for  such 
a  loop.  In  a  software-oriented  approach,  which  is  used  in  the  Cray  vector  implementations, 
the  strip  that  executes  N  mod  64  iterations  is  executed  first. 


28 


<  scalar  code  to  calculate  N  mod  64  > 


VL  <-  64 

VO  <-  0 

VL  <-  N  mod  64 

;  assign  length  of  first 

strip 

R1  <-  address 

of  A(l) 

R2  <-  N 

TOP:  MCRl]  <-  VO 

;  store  0  into  elements 

of  array 

R1  <-  Rl+VL 

;  update  address 

R2  <-  R2-VL 

;  update  branch  counter 

VL  <-  64 

;  update  strip  length 

BRNZ  R2,T0P 

A 


Because  64  is  a  power  of  2,  the  value  N  mod  64  can  be  computed  by  using  simple  shift  and 
mask  operations  and  no  divide  or  remainder  calculations.  Alternatively,  for  a  hardware- 
only  method,  which  is  used  in  the  IBM  System/370  vector  architecture  [16],  the  strip  that 
executes  N  mod  64  iterations  is  executed  last.  A  special  instruction,  VLVCU  (Load  Vector 
Count  and  Update),  first  subtracts  the  number  of  iterations  executed  for  a  strip  from  a 
register  that  holds  the  number  of  remaining  iterations  to  be  executed,  and  then  it  sets  the 
condition  code  to  indicate  whether  or  not  the  difference  is  greater  than  or  equal  to  zero. 
Because  the  number  64,  which  is  the  number  of  registers  in  a  vector  register,  does  not  need 
to  appear  in  the  generated  code,  implementing  stripmining  in  this  fashion  allows  processors 
with  different  vector-register  lengths  to  be  binary  compatible. 

Stripmining  can  be  used  even  when  the  value  of  N  can  be  determined  only  by 
executing  the  loop.  For  example,  because  there  are  no  self-dependent  statements,  the 
foDowing  loop  can  be  vectorized  despite  the  potential  early  exit: 


DO  10  I»1,N 

A(I)  =  B(I)*C(I)  +  4/D(I) 
if  (  A(I).eq.S(I)  )  goto  20 
10  CONTINUE 
20  ... 

However,  the  resultant  stripmined  code  becomes  more  complex  and  involves  the  possible 
execution  of  unnecessary  operations.  Because  of  the  complexity  and  variability  in  perfor¬ 
mance,  few  vectorizing  compilers  attempt  to  vectorize  such  loops.  For  the  interested  reader, 
Wolfe  describes  techniques  for  vectorizing  such  loops  [120,  Chapter  10). 

Stripmining  is  conceptually  similar  to  the  scalar  optimization  technique  called  loop 
unrolling  whereby  the  body  of  a  loop  is  replicated  n  times  and  n  is  called  the  amount  of 
unrolling.  A  strip  that  executes  n  iterations  of  a  loop  is  analogous  to  a  loop  body  that 
has  been  unrolled  n  times.  One  of  the  benefits  of  loop  unrolling  is  that  more  independent 
operations  are  available  to  execute  in  parallel.  However,  in  order  to  correctly  execute 
these  operations,  accurate  information  about  data  dependence  is  needed  as  is  the  case  for 
vectorization  and  stripmining.  Moreover,  optimizations  applied  to  an  unrolled  loop  body  to 
eliminate  redundant  computations  should  result  in  code  similar  to  that  produced  for  scalar 
instructions  in  a  stripmined  loop.  Based  on  this  analogy,  the  hardware-only  implementation 
of  stripmining  in  the  IBM  System/370  vector  architecture  can  be  considered  hardware 
support  for  loop  unrolling  where  the  amount  of  unrolling  is  set  by  the  hardware. 
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Despite  the  similarities  between  stripmining  and  loop  unrolling,  the  former  has 
advantages  over  the  latter.  For  example,  other  than  having  to  calculate  N  mod  64,  no 
extra  code  is  needed  to  execute  the  N  mod  64  iterations  because  the  s^e  code  is  used  to 
execute  strips  of  any  length.  Moreover,  this  code  is  reasonably 

of  unrolling.  In  contrast,  the  extra  code  required  to  execute  the  N  mod  64  iterations  for 
loop  unrolling  can  be  either  a  non-optimal,  rolled  version  of  the  loop  or  optimized,  enrolled 
verLns  for  each  residual  value  that  is  possible.  Another  advantage  to  stripmimng  is  that  he 
number  of  instructions  generated  does  not  increase  substantially  over  the  number  genera  e 
for  the  rolled  version  because  vector  instructions  are  used.  Hence,  stripmning  and  the  use 
of  vector  instructions  avoids  the  higher  instruction-bandwidth  incurred  by  loop  unrolhng 
while  taking  advantage  of  the  same  fine-grain  parallelism  that  loop  unrolhng  uses. 


2.4  Summary 

In  this  chapter,  I  described  how  vector  architectures  support  fine-grain  par^ehsm 
in  both  hardware  and  software,  and  I  contrasted  this  architectural  approach  with  thr^ 
others:  superpipeUned,  superscaler,  and  VLIW.  What  follows  is  a  suinmary  of  tho^  aspects 
that  are  common  to  any  architecture  that  supports  fine-gram  paraUehsm  and  those  that 
are  specific  to  a  vector  architecture. 

A  program’s  functionality  must  not  change  even  though  the  operations  of  a  pro¬ 
gram  are  performed  in  a  different  order  when  using  parallel  instead  of  scalar  execution. 
A  program’s  functionality,  which  is  reflected  by  its  output,  wiU  not  change  if  the  order 
of  Lcesses  to  each  storage  location  does  not  change.  Because  data  dependences  stipdate 
the  order  of  accesses,  preserving  data  dependences  is  the  key  to  guaranteeing  correct  func¬ 
tionality.  As  a  result,  any  architecture  that  supports  fine-grain  paraUehsm  must  provide  a 
mechanism  for  handling  data  dependences  to  ensure  correct  functionality. 

Dependences  can  occur  in  two  different  storage  locations:  registers  or  mam  mem¬ 
ory.  Hardware  mechanisms  are  typically  used  for  resolving  register  dependences,  exainples 
of  which  are  register-renaming  and  data-forwarding  in  superscalar  and  superpipehned  ar¬ 
chitectures,  and  chaining  and  tailgating  in  vector  architectures.  In  contract,  the  compUer  is 
usually  responsible  for  handling  memory  dependences.  Because  it  is  responsible  for  ensur¬ 
ing  correct  functionality,  a  crucial  component  of  such  a  compiler  is  its  dependence  analyzer 
that  identifies  dependent  operations.  There  are  two  major  methods  for  detecUng  depen¬ 
dences  in  software.  The  more  established  method  operates  only  on  arrays  which  have  an 
explicit  addressing  mechanism.  Dependences  are  detected  by  solving  an  algebraic  equation 
constructed  from  the  address  functions  of  array  references. 

In  addition  to  handling  dependences,  any  architecture  that  executes  more  than 
one  operation  per  clock  period  must  be  able  to  perform  multiple  instances  of  the  ba^ic  ex¬ 
ecution  sequence:  initiate  operation,  fetch  operand(s),  execute  operation,  a.nd  store  residt. 
Whereas  there  are  many  techniques  for  performing  multiple  instances  of  each  of  these  tasks, 
only  the  first  is  specifically  associated  with  a  particular  architectural  approach:  for  example 
superpipelined  architectures  use  longer  pipeUnes  to  produce  a  faster  clock;  superscalar  an 
VLIW  architectures  use  the  obvious  approach  of  simultaneously  issuing  multiple  operations 
from  the  instruction  unit;  and  vector  architectures  use  overlapped  execution  of  multiple  vec- 


30 


tor  instructions.  The  second  and  fourth  tasks,  fetching  and  storing  inultiple  values  in  the 
same  clock  period,  can  be  accomplished  by  several  register  file  orpnizations:  monolithic, 
partitioned,  distributed,  and  combinations  thereof.  These  organizations  vary  in  their  degree 
of  connectivity  to  the  functional  units,  which  creates  a  trade  off  between  accessibility  and 
hardware  costs.  Although  there  is  no  hard  and  fast  rule  for  using  a  particular  organization 
with  a  specific  architecture,  superscalar  architectures  typically  use  a  monolithic  configura¬ 
tion,  vector  architectures  use  a  partitioned  one,  and  VLIW  architectures  use  combinations 
of  these  organizations.  Finally,  all  of  the  architectures  perform  the  third  task,  executing 
multiple  operations,  in  a  uniform  fashion  by  using  several  functional  units. 

In  fact,  because  only  techniques  for  accomplishing  the  first  task  are  associated 
with  a  particular  architectural  class  and  techniques  for  accomplishing  the  remaimng  tasks 
could  be  used  by  any  architecture,  the  defining  element  of  an  architecture  that  can  «ecute 
more  than  one  operation  per  clock  period  is  how  it  initiates  more  than  one  operation  per 
clock  period.  For  a  vector  architecture,  the  key  characteristic  is  the  rector  instruction,  one 
that  causes  multiple  operations  to  execute  sequentially  in  the  same  functional  unit.  This 
instruction  provides  a  simple  intuitive  model  for  understanding  how  fine-grain  parallelism 
can  be  used  by  hardware  and  by  a  compiler. 

Whereas  all  four  architectural  approaches  are  attempting  to  use  the  parallelism 
present  in  a  program,  a  vector  architecture  does  so  through  the  use  of  its  vector  instructions. 
Fine-grain  parallelism  exists  in  two  places;  among  operations  from  different  basic  blocks 
and  among  operations  within  the  same  basic  block.  Parallelism  across  different  basic  blocks 
is  used  by  the  vector  instruction  itself,  which  is  essentially  a  compact  form  of  loop  unroUing. 
This  type  of  parallelism  is  also  manifested  when  a  vector  instruction  continues  to  initiate 
operations  after  the  next  basic  block  begins  executing.  Such  a  situation  will  occur  as  long 
as  completely  independent  resources  are  used.  Parallelism  within  the  same  basic  block  is 
made  use  of  through  the  overlapped  execution  of  several  vector  instructions. 

Although  vector  instructions  use  the  fine-grain  parallelism  present  in  a  program, 
not  all  parts  of  a  program  can  be  executed  with  vector  instructions  because  of  how  vector 
instructions  are  implemented.  A  program  fragment  with  the  properties  listed  in  Figure  2.8 
can  be  vectorized.  The  presence  of  either  non-aggregate  variables  or  scalar  self-dependences 
does  not  prevent  vectorization.  A  common  example  of  a  non-vectorizable  loop  is  one  that 
performs  pointer-chasing  through  a  linked  Ust;  such  a  loop  cannot  be  vectorized  because  the 
statement  that  performs  the  pointer-chasing  (p=p->next)  contains  an  aggregate  variable  ^d 
directly  depends  upon  itself.  A  technique,  which  is  conceptually  similar  to  loop  unrolfing 
and  is  called  stripmining,  is  used  to  execute  a  vectorizable  loop.  Moreover,  operations 
that  either  have  an  aggregate  variable  as  input  or  generate  an  aggregate  variable  as  output 
are  translated  into  vector  instructions;  other  operations,  such  as  address  calculation  or 
branch  computation,  are  translated  into  scalar  instructions  in  a  manner  similar  to  how  this 
translation  is  done  when  loop  unrolling. 

The  properties  of  a  vectorizable  program  fragment  do  not  include  any  characteristic 
of  a  loop’s  branch  computation,  which  determines  the  number  of  iterations,  or  loop  length, 
which  are  executed  for  a  particular  invocation  of  a  loop.  Although  it  may  seem  that  knowing 
the  length  of  a  loop  without  executing  it  is  necessary  for  vectorization,  this  is  not  true. 
The  easiest  type  of  loop  to  vectorize  is  certainly  one  whose  length  can  be  determined  in 


CHARACTERISTIC  OF  VECTOR 

HARDWARE 

PROPERTY  OF  A  VECTORIZABLE 
PROGRAM  FRAGMENT 

a  vector  instruction  specifies  only 
one  type  of  operation 

a  program  fragment  is  a  loop  in 
which  statements  are  executed  mul¬ 
tiple  times 

each  operation  in  a  vector  instruc¬ 
tion  accesses  a  different  storage 
location 

a  loop  has  at  least  one  variable, 
called  an  aggregate  variable,  that  ac¬ 
cesses  a  different  storage  location  on 
each  iteration 

a  vector  address-generator  can  only 
do  addition 

the  memory  accesses  of  each  ag¬ 
gregate  variable  form  an  arithmetic 
progression,  which  involves  only 
addition 

vector  instructions  do  not  interleave 
the  execution  of  their  operations 

any  statement  containing  an  aggre¬ 
gate  variable  cannot  depend  on  itself 

Figure  2.8:  Properties  of  a  Vectorizable  Program  Fragment 

This  table  summarizes  the  relationship  between  a  characteristic  of  vector  hardware  and  the 
corresponding  property  of  a  vectorizable  program  fragment.  Details  explaining  these  relationships 
can  be  found  in  Section  2.3.1. 
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1  f  V  Sc  a  FORTRAN  DO  loop.  But  a  loop  whose  length 
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execntion  of  operations  from  different  vector  mstruc  mns 
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vector  instructions  are  an  unroUed  version  of  a  vectorizable  loop  ^ 
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effective  my  algorithms  are  at  using  the  hardware. 
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Chapter  3 

A  Case  for  Vector  Architectures 


VLSI  technology  continues  to  improve  at  a  phenomenal  rate  The  combination  of 
decreasing  area  for  one  transistor  and  increasing  area  for  an  entire  clup  has  resulted  in  a 
proliferation  of  transistors  in  a  single  chip.  Since  the  introduction  of  the  first 
the  Intel  4004  with  2300  transistors,  in  1971,  the  number  of  transistors  P" 
consistently  doubled  every  two  years  resulting  in  an  average  yearly  growthjate  of  about 
1  4  [471.  I^  1986,  Myers,  Yu  and  House  predicted  that  a  VLSI  chip  that  has  10  mito 
transisiors  could  be  manufactured  in  1995  [86].  As  evidence  that  this  prediction  ,s  w^ 
within  reason,  the  Mowing  table  demonstrates  the  rapid  growth  in  the  transistor  count  of 
CMOS  processors  since  Myers  et  al.  made  their  prediction: 


NUMBER  OF 


YEAR 

PROCESSOR  NAME 

transistors 

1985 

Intel  i80386 

275,000 

1989 

Intel  i860 

1,000,000 

1991 

Intel  i860  XP 

2,500,000 

Sun  SuperSPARC 

3,100,000 

reference 

[86] 

[73,  93] 

[61] 


As  shown  by  the  transistor  counts  for  1989  and  1991,  this  growth  rate  is  expected  to  increase 
fo  *  J  memory,  which  is  denser  than  processor  logic,  is 

Based  on  these  growth  trends,  producing  a  ten-million  transistor  chip  should  be  feasible  by 

1995  — 

Such  a  large  VLSI  processor  can  be  used  in  at  least  two  types  of  computers.  One 
type  of  computer  is  the  workstation,  a  high-performance  desktop  or  deskside  computer 
Sth  a  grapWciJ  user  interface.  The  workstation  is  targeted  for  scientific  Mdengmwrmg 
aoplicafions  which  is  reflected  by  the  fact  that  6  of  the  10  programs  in  »he  SPEC  benchmark 
suite  come  from  this  application  domain  1106).  The  purpose  of  the  SPEC  benchmark  suite  is 
,0  chronicle  the  performance  of  workstations.  Programs  for 

scientific/engineering  can  be  characterized  as  being  compute-intensive  with  a  high  demand 

for  in  which  a  large  VLSI  processor  is  advantageous  is  one  in  which 

100  or  even  1000  or  more  processors  are  connected  Such  a  computer,  c^ed  a  mussiuelp 
parallel  processor  or  MPP,  is  designed  to  meet  a  hardware  chaUenge  of  the  1990  s. 
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As  VLSI  technology  continues  to  improve,  processor  architects  are  consiaermg 

‘"■“'SS'r=?5;= 

TcCthan  a  vec  tor  architecture,  which  many  designers  mistakenly  believe  is  rnherently 
costly  to  ua  .ha.  .he  majority  of  comm^dally 

.ccessfui  "tX 

lectd^  for  fabrication  and  exotic  techniques  for  packaging  and  cooling  ’ 

a  tetoriS  correlation  between  high  cost  and  a  vector 

‘foirnglTe^:;  foTb^^  Ltretu:^";  ip  m”  P.  In  fact,  three  companies 
have  recently  implemented  a  vector  architecture  in  VLb  . 

.  Thinking  Machines  Corporation  has  extended  the  SPARC  architecture  with  a  vector 
architecture  for  use  in  their  MPP,  the  Connection  Machine  5  |109), 

.  NEC  has  implemented  a  single-chip  vector  processor  with  693  000  transistors  in  0.8  pm 
BiCMOS  technology  using  a  clock  frequency  of  100  Mhl  [90],  and 

.  Fujitsu  has  implemented  a  single-chip  vector  tT^mtors  in 

0.5  pm  CMOS  technology  using  a  clock  frequency  of  70  Mhz  [6  ]. 

Contrary  to  the  beliefs  of  proponents  of  superscalar  architectures,  I  believe  that  a 
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to  mitigate  the  effects  of  Amdahl’s  Law,  a  combined  vector  and  superpipeUned  architecture 
can  take  advantage  of  what  little  parallelism  there  is  in  non-vectorizable  programs.  Finally, 

I  discuss  some  of  the  software  advantages  of  a  vector  architecture. 

3.1  Hardware  Advantages  of  Vector  Architectures 

One  common  misconception  about  vector  architectures  is  that  the  hardware  re¬ 
sources  they  require  are  many  times  more  expensive  than  those  of  superscalar  architectures, 
in  particular  the  vector  register  file  and  the  high-bandwidth  memory  system.  In  fact,  as 
discuss  in  this  section,  the  hardware  resources  required  to  implement  a  vector  architecture 
are  comparable  or,  in  some  aspects,  even  less  costly  when  supporting  the  same  amount  of 
parallelism  as  a  superscalar  implementation.  In  particular, 

•  the  number  of  functional  units  can  be  the  same; 

•  the  need  for  a  high  performance  memory  is  the  same; 

•  the  area  of  a  vector  register  file  is  comparable  to  that  of  a  superscalar  multiported 
register  file;  and 

•  the  instruction-issue  logic  of  a  vector  implementation  is  less  complicated. 

Moreover,  as  hardware  designers  increase  the  amount  of  parallelism  in  the  processor,  the 
cost  advantage  of  a  vector  architecture  over  a  superscalar  one  becomes  more  pronounced 
with  respect  to  the  register  file  and  issue  logic. 

3.1.1  Number  of  Functional  Units 

To  support  the  simultaneous  initiation  of  N  operations,  N  functional  units  must 
be  provided  in  either  vector  or  superscalar  architectures.  Furthermore,  either  one  will  need 
the  same  number  of  buses  to  deliver  operands  and  results  between  the  functional  units  and 
the  register  file  because  this  number  is  solely  dependent  on  the  number  of  functional  units 
provided.  Hence,  the  implementation  cost  of  functional  units  is  identical  for  superscalar 
and  vector  architectures  that  support  the  same  amount  of  fine-grain  parallelism. 

3.1.2  High-Performance  Memory  System 

Based  on  past  implementations,  computer  designers  mistakenly  believe  that  a  vec¬ 
tor  processor  must  have  a  more  expensive,  high-bandwidth  memory  system  than  one  re¬ 
quired  by  a  superscalar  processor.  Although  this  may  be  true  historically,  I  do  not  believe 
an  expensive  memory  system  is  a  fundamental  requirement  of  a  vector  processor.  Moreover, 
I  believe  that  a  less  costly  memory  system  that  is  suitable  for  a  superscalar  architecture 
should  also  be  suitable  for  a  vector  architecture.  This  is  because  either  one,  and  in  fact 
any  processor  architecture  that  supports  fine-grain  parallelism,  wiU  place  a  comparable  de¬ 
mand  on  memory  bandwidth  as  a  natural  consequence  of  executing  multiple  operations  per 
clock  period.  Studies  of  instruction  mixes  show  that  memory  operations  make  up  20-30% 
of  the  instructions  executed  for  a  typical  program  [92].  In  a  scalar  implementation  with  a 
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performance  goal  of  one  instruction  per  clock  period,  the  memory  demand  for  data  is  one 
memory  access  every  three  to  five  clock  periods.  In  a  superscalar  or  vector  implementation, 
where  multiple  operations  are  executed  per  clock  period,  the  demand  for  data  can  be  as 
frequent  as  a  memory  access  every  two  to  three  clock  periods  or  even  every  clock  period.  In 
fact,  with  enough  datapath  parallelism  in  the  hardware,  the  demand  for  data  can  be  greater 
than  one  memory  access  per  cycle. 

Because  memory  demand  for  data  can  be  so  high,  multiple  memory-ports  will  be 
necessary  with  increasing  datapath  parallelism  in  either  a  vector  or  superscalar  architec¬ 
ture.  Continually  increasing  the  parallelism  of  the  datapath  without  increasing  the  number 
of  memory  ports  will  ultimately  make  memory  the  bottleneck  to  improved  performance. 
Memory  ports  are  more  expensive  to  add  than  floating-point  units,  however,  because  in¬ 
creasing  their  numbers  impacts  the  entire  memory  system.  Although  several  vector  com¬ 
puters  have  implemented  multiple  memory-ports,  superscalar  architectures  have  yet  to  do 

SO. 

Vector  computers  typically  use  a  large,  highly-interleaved  main  memory  built  from 
expensive  SRAM  chips.  In  contrast,  superscalar  computers  follow  their  scalar  ancestors  by 
using  cache-based  memory  systems  that  are  less  costly  and  reputedly  provide  sufficient 
performance.  But,  not  only  is  the  demand  for  memory  bandwidth  independent  of  the 
processor  architecture,  so  is  the  implementation  of  the  memory  system.  Consequently, 
although  high-cost  memory  systems  have  routinely  been  used  in  vector  computers,  a  cached- 
based  memory  system  could  be  used  as  a  more  cost-effective  solution.  For  example,  the 
IBM  3090  has  a  64-Kbyte,  4-way  set-associative  cache  [113].  Furthermore,  research  into 
cache  designs  that  provide  high  memory-bandwidth  for  superscalar  architectures  [104]  could 
also  be  applied  to  vector  architectures. 

Cache-based  vector  computers  are  not  common  because  popular  wisdom  suggests 
that  scientific  and  engineering  programs,  which  are  most  suitable  for  a  vector  architecture, 
have  memory  reference  patterns  with  poor  spatial  and  temporal  locality.  Although  these 
characteristics  result  in  poor  cache  performance,  this  performance  has  more  to  do  with  the 
program  itself  rather  than  the  design  of  the  processor.  Accordingly,  if  a  cache-based  vector 
computer  exhibits  poor  cache  performance,  so  wiU  a  cache-based  superscalar  computer  when 
executing  the  same  program. 

Evidence  that  the  program  is  the  major  influence  on  cache  behavior  comes  from 
Clark  and  Wilson  who  present  performance  data  for  the  vector  cache  in  the  IBM  3090 
[21].  To  multiply  two  matrices  of  dimensions  300  x  N  and  N  x  100,  where  N  is  varied 
from  50-600  by  increments  of  50,  they  use  three  different  algorithms^  to  improve  the  cache 
performance  as  measured  by  execution  time.  They  find  that  for  each  algorithm,  the  cache 
performance  curves  of  both  scalar  and  vector  processing  have  the  same  shape.  Moreover, 
the  straightforward  algorithm  for  matrix  multiply  has  declining  cache  performance  for  both 
scalar  and  vector  processing  as  the  problem  size  increases,  whereas  two  blocked  algorithms 
have  cache  behavior  that  is  insensitive  to  the  size  of  the  problem.  Rather  than  working  on  an 
entire  row  or  column  of  a  matrix,  such  blocked  algorithms  rearrange  computations  to  work 
on  submatrices  or  blocks  that  wiU  fit  in  a  cache  [45,  77].  Such  rearrangements  are  designed 
not  to  affect  the  vectorizability  of  a  program  [29].  Hence,  if  a  blocked  algorithm  performs 


use  the  words  program  and  algorithm  interchangeably. 
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well  on  a  superscalar  architecture,  it  should  also  perform  well  on  a  vector  architecture. 

Although  memory  implementation  and  memory  demand  are  independent  of  the 
processor  architecture,  a  vector  architecture  has  several  features  that  can  simplify  the  design 
of  a  memory  system:  less  memory  traffic  than  in  a  superscalar  architecture,  a  built-in  mech¬ 
anism  for  prefetching  data,  and  fewer  wires  in  the  memory  interface  than  what  is  needed 
for  a  superscalar  architecture.  The  last  feature  is  especially  advantageous  when  multiple 
memory-ports  are  implemented.  In  the  following  paragraphs,  I  qualitatively  describe  the 
advantages  of  these  features,  leaving  a  quantitative  analysis  for  a  future  study. 

First,  whenever  parallelism  is  exhibited,  memory  traffic  from  a  vector  architecture 
is  less  than  it  is  from  a  superscalar  architecture.  Parallelism  demands  substantial  data 
bandwidth.  In  a  vector  architecture,  however,  memory  traffic  for  instructions  does  not 
increase  when  memory  traffic  for  data  does  because  vector  instructions  are  used.  Conversely, 
in  a  superscalar  architecture,  instruction  demand  increases  in  conjunction  with  data  demand 
because  each  operation  corresponds  to  an  instruction.  Even  if  an  instruction  cache  is  used 
to  reduce  traffic  to  main  memory,  a  cache  for  a  superscalar  architecture  must  deliver  many 
more  instructions  than  one  in  a  vector  architecture.  Moreover,  such  an  instruction  cache 
may  also  have  to  be  larger  to  provide  a  good  hit  ratio  when  techniques  such  as  loop  unrolling 
are  used  to  increase  the  amount  of  available  parallelism  at  the  expense  of  increased  code 
size. 

Second,  vector  memory  instructions  prefetch  data  from  the  memory  system.  Be¬ 
cause  a  stream  of  references  through  a  memory  port  could  exhibit  a  regular  pattern,  a 
high-bandwidth  memory  system  can  be  designed  to  take  advantage  of  this  regularity.  A 
vector  memory  instruction  encodes  this  pattern  in  three  pieces  of  information:  the  base 
address,  the  offset  between  successive  addresses  (known  as  the  stride),  and  the  number  of 
words  to  access.  Hence,  the  memory  system  is  told  about  the  pattern  at  the  time  a  vector 
memory  instruction  is  issued.  In  contrast,  a  superscalar  architecture  treats  memory  refer¬ 
ences  individually.  Consequently,  additional  hardware  would  be  needed  to  first  discover  the 
pattern.  Alternatively,  prefetching  could  be  performed  by  the  software  by  including  extra 
instructions  [17,  72].  A  disadvantage  of  this,  however,  is  an  increase  in  memory  traffic  for 
instructions. 

Finally,  the  physical  package  that  contains  a  vector  VLSI  processor  requires  sig¬ 
nificantly  fewer  pins  to  communicate  with  the  memory  system  than  one  that  contains  a 
superscalar  processor.  This  is  an  important  consideration  when  implementing  a  single-chip 
VLSI  processor  with  a  limited  number  of  pins.  Because  a  superscalar  architecture  has  no 
mechanism  for  encapsulating  multiple  memory  references,  the  processor  computes  and  sends 
the  address  of  each  memory  operation  to  the  memory  system.  Consequently,  each  memory 
port  must  have  associated  with  it  both  a  data  and  an  address  bus.  In  addition,  because 
each  address  must  be  sent  out  from  the  processor  and  the  demand  for  data  will  be  one  or 
more  accesses  each  clock  period,  the  address  pins  will  be  used  just  as  frequently. 

In  contrast,  a  vector  memory  instruction  provides  only  three  pieces  of  informa¬ 
tion  to  specify  multiple  memory  references.  As  long  as  the  memory  system  can  generate 
addresses,  this  information  can  be  sent  once  at  the  time  an  instruction  is  issued  rather 
than  sending  an  address  for  each  reference  of  a  vector  memory  instruction.  Although  this 
complicates  the  memory  controller  somewhat,  the  number  of  address  buses  is  kept  to  only 
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of  a  VLSI  chip  with  hundreds  of  pins. 


3.1.3  Register  File 

Hardware  designers  also  mistakenly  believe  that  the  vector  register  file  is  TOstly 
::Z  l  -  a"on"^r:tion 

use  a  monoUthic  one.  These  two  conhgurations  represent 
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in  both  the  Hewlett  Packard  Snake  workstation  and  Sun  SuperSPA  ^  o  i’  3  2 

of  3?rerters  with  5  read-ports  and  3  write-ports.  As  shown  m  F.gures  3T  and  3.2 

du^^^id  registers  have  a  smaller  -  Xr^i^on^ 

64-bits  wide,  the  area  of  this  vector  register  file  (67.2  x  10  A  )  is  only 
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'partTcX^arciLtrai^pp^  suggests  that  any  architecture  that  supports 

narallplism  will  need  to  provide  more  registers. 

^  The  need  for  more  registers  is  also  demonstrated  In  several  commercial  implemen- 

fons  The  IBM  RS/6000,  a  superscalar  design,  actually  implements  38  ffoating-pomt 

.^u  programmer  and  are  used  for  renaming  re^sters. 
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Figure  3.1;  Design  of  a  Multiported  Register  Cell 

The  multiported  register  cells  listed  in  Figure  3.2  are  a  straightforward  extension  of  the  IR^IW 
register  cell  design  shown  above.  In  designing  the  cells,  the  VLSI  technology  used  is  scalable  CMOS 
wLre  0  4n  <  A  <  2u  and  the  minimum  line  width  is  2A.  I  use  only  two  metal  layers  with  a  minimum 
pitch  of  8A  ri.e.,lhe  minimum  width  of  a  metal  line  is  4A  and  the  minimum  distance  between  lines 
is  4A)  VLSI  technology  that  is  capable  of  more  than  two  metal  layers  would  not  help  in  reducing 
the  size  of  the  register  cell  because  the  extra  layers  of  metal  are  much  coarser:  a  minimum  width 

about  three  to  five  times  wider  than  that  of  the  first  two  layers.  • 

The  memory  portion  of  each  register  cell  is  a  pair  of  cross-coupled  inverters  consisting  of  four 
transistors  that  force  a  minimum  height  of  41A.  To  access  the  register  cell,  each  port 
transistor,  a  select  line,  and  a  data  line.  In  addition,  a  write  port  requires  a  second  acc^ 
and  data  line.  In  the  diagram  above,  the  top  two  transistors  are  the  access  devices  for  the  write 
port  and  the  bottom  transistor  is  the  access  device  for  the  read  port. 

The  area  of  the  register  cell  grows  approximately  as  the  square  of  the  number  of  ports  added 
because  each  port  forces  the  cell  to  grow  in  both  height  and  width.  Because  the  memory  portion  of 
the  cell  can  accommodate  three  select  lines  running  width-wise  across  the  cell,  the  height  of  the  cell 
does  not  grow  until  more  than  three  ports  are  implemented,  after  which  each  port  adds  8A  to  the 
height.  A  read  port  adds  14A  to  the  width:  8A  for  the  data  line  and  6A  for  the  access  transistor.  A 
write  port  adds  28A  to  the  width  because  it  requires  two  data  lines  and  two  access  transistors. 


NUMBER 
OF  PORTS 


dimensions 
(w  X  h) 


50A  X  41A 
64  A  X  41  A~ 


COMMERCIAL 

MACHINES  _ 


205QA^  (1.00) _ _ _ 

2624 A^  (1-28)  MIPS  R3000, 

MIPS  R4000, 

most  RISC  scalar  micropro- 


4R,2W 

5R,3W‘ 


iMiTeW  7800A'  (3.80)  |  IBM  ^5/6000, 

SUN  SuperSPARC  lU  _ 

ImXTsIT  13122A'J  (eiioT 

f^TTN  SuperSPARC  FPU 


Figure  3.2;  Area  Requirements  of  Multiported  Register  Cells 

that  implements  a  register  cell  with  5  read  ports  and  3  write  ports. _ _ _ 
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Figure  3.3:  Number  of  Registers  versus  Parallelism 

Based  on  Wall’s  parallelism  data  11151,  this  graph  shows  that  the  nnmber  of  register,  r^erts 
1.  II  lUm  can  be  extracted  To  show  the  extent  to  which  the  number  of  registers  can 

Similar  trends  also  occur  for  more  realistic  models  of  computation. 
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VLIW  implementations,  which  are  designed  with  more  datapath  parallelism  than  is  found 
in  superscalar  machines,  have  a  large  number  of  64-bit  registers; 


IMPLEMENTATION 

Intel  iWarp 
Multiflow  Trace  7 
Multiflow  Trace  14 
Multiflow  Trace  28 
Cydrome  Cydra  5 


NUMBER  OF 

64-bit  registers 
64 

7  X  32  =  224 
14  X  32  =  448 
28  X  32  =  896 
6  X  64  =  384 


Finally,  vector  architectures  use  a  large  number  of  registers  ranging  from  512  in  a  Cray 
processor  to  8192  in  Ardent  and  Fujitsu  processors  (see  Figure  2.5  on  page  15). 

Given  that  more  than  32  registers  are  required  to  support  a  reasonable  amount  of 
fine-grain  parallelism.  Figure  3.4  shows  that  the  partitioned  approach  is  more  attractive  than 
the  monolithic  approach  from  the  perspective  of  hardware  cost  because  for  the  same  area 
many  more  registers  can  be  implemented  in  a  partitioned  register  file  than  in  a  monolithic 
one.  The  overall  size  of  the  register  file  is  determined  mainly  by  the  size  of  the  register  cell, 
the  most  replicated  part  of  the  register  file.  Other  components  that  are  needed  to  access 
the  register  file,  such  as  decoders  and  read/write  drivers  for  the  data  lines,  are  typicafiy  less 
than  5%  of  the  area  required  by  the  register  cells  themselves  (assuming  64-bit  registers). 
Consequently,  the  relative  size  of  the  two  register  files  is  equal  to  the  relative  sizes  of  the 
register  cells.  If  Sr,w  is  the  size  of  a  register  cell  with  R  read  ports  and  W  write  ports, 
the  partitioned  register  file  can  implement  ^  times  more  registers  in  the  same  area. 
Alternatively,  the  partitioned  register  file  requires  times  less  area  to  implement  the 

same  number  of  registers.  Figure  3.2  lists  some  values  for  the  ^  ratio. 

In  fact,  this  difference  in  area  may  be  reversed  with  increasing  datapath  parallelism 
because  a  monolithic  register  file  needs  to  expand  even  if  the  number  of  registers  remains 
unchanged,  whereas  a  partitioned  register  file  does  not.  Increasing  datapath  parallelism 
requires  a  corresponding  increase  in  the  number  of  ports  in  the  register  file  because  these 
should  match  the  number  of  operands  and  results  used  and  produced  by  the  functional 
units.  Of  the  two  techniques  for  providing  a  multiported  register  file,  the  partitioned  ap¬ 
proach,  which  uses  dual-ported  registers,  scales  better  with  increasing  datapath  parallelism 
than  does  the  monolithic  approach,  which  uses  multiported  registers.  To  support  more 
parallelism,  the  monolithic  approach  uses  a  register  cell  with  more  ports.  A.s  Figure  3.2 
shows,  adding  more  ports  to  a  register  cell  expands  the  area  in  both  dimensions.  Hence, 
the  area  of  a  monolithic  register  file  enlarges  as  the  square  of  the  increase  in  the  number 
of  ports  even  though  the  number  of  registers  does  not  change.  By  contrast,  adding  more 
ports  to  a  partitioned  register  file  entails  partitioning  the  file  further  but  without  having 
to  change  the  size  of  the  register  cell.  If  the  total  number  of  registers  remains  unchanged, 
only  a  minimal  increase  in  area  will  result  when  adding  more  “ports”  to  the  register  file. 

Although  a  partitioned  register  file  requires  less  area  than  a  monolithic  one,  the 
former  also  has  restrictions  on  which  registers  are  available  for  use  each  clock  period.  How¬ 
ever,  as  datapath  parallelism  is  increased,  the  partitioned  approach  will  be  the  better  design 
for  reasons  of  cost,  despite  the  loss  in  functionality. 
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Figure  3.4:  Area  Requirements  of  Monolithic  and  Partitioned  Register  Files 

The  graph  above  compares  the  area  requirements  of  the  monolithic  and  partitioned  approach^ 
for  implementing  a  multiported  register  file.  Note  the  log  scale  on  both  axes.  The  areas  are  based 
on  the  design  of  the  register  cells  described  in  Figure  3.2. 

As  points  of  reference.  I  have  identified  data  points  that  correspond  to  machines  with  the  same 
register  file  parameters  although  not  necessarily  the  same  area  because  the  actual  implementation 
may  use  a  different  technology.  In  particular,  assuming  the  same  technology,  the  register  file  of 
the  Cray  Y-MP  would  require  2.5  times  as  much  area  as  the  register  file  of  the  Texas  Instrumen 
Megacell,  but  it  would  provide  16  times  as  many  registers. 
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A  superscalar  architecture  could  implement  a  partitioned  register  file,  although 
this  is  not  traditionally  done.  To  prevent  any  loss  in  performance  as  a  result  of  too  many 
conflicts  in  register  accesses,  however,  a  new  algorithm  would  be  needed  to  assign  values 
to  registers  in  register  banks.  For  example,  the  Multiflow  Trace  14/300  system,  winch  is 
a  VLIW  architecture,  has  a  register  file  with  read  and  write  restrictions  that  effectively 
partition  the  file  into  register  banks.  Experience  with  this  system  led  its  designers  o 
observe  that  “the  more  byzantine  the  constraints  put  on  the  code  generator,  the  worse  the 
code  quality,”  which  in  turn  results  in  performance  loss  [22].  Because  of  the  similarities 
between  a  VLIW  architecture  and  a  superscalar  one,  this  observation  would  probably  ho 
true  for  a  superscalar  implementation  with  a  comparable  register  file  orgamzation.  Hence, 
the  need  for  a  good  assignment  algorithm  is  probably  the  main  reason  why  superscalar 

architectures  have  not  implemented  a  partitioned  register  file.  , .  . 

In  contrast,  a  partitioned  register  file  fits  well  with  the  vector  architecture  in 
that  no  special  algorithm  for  register  assignment  is  needed  to  overcome  the  restrictive 
accessibility  of  such  a  design.  Because  a  vector  instruction  operates  on  a  vector  of  data 
which  can  be  assigned  to  a  vector  register,  register  assignment  for  a  vector  register  file  can 
occur  at  the  vector-register  level  rather  than  at  the  level  of  individual  registers  Hence,  as  a 
natural  consequence  of  the  vector  computational  model,  traditional  algorithms  for  assigning 
values  to  registers  in  a  scalar  register  file  can  be  used  to  assign  vectors  of  data  to  vector 
registers  in  a  vector  register  file.  Furthermore,  a  vector  architecture  can  easily  support  more 
datapath  parallelism  because  a  partitioned  register  file  can  support  more  register-ports  with 

a  minimal  increase  in  area. 


3.1.4  Instruction-Issue  Logic 

Another  advantage  of  a  vector  architecture  is  its  simpler  instruction-issue  mech¬ 
anism  for  simultaneously  initiating  multiple  operations.  In  general,  before  ^  instruction 
can  be  issued,  interlock  logic  in  the  hardware  must  first  determine  that  no  data  or  struc¬ 
tural  hazards  exist  between  an  instruction  and  any  previous  one  in  the  instruction  stream. 
In  addition  to  the  number  of  instructions  that  must  be  examined  simultaneously,  another 
indicator  of  the  amount  of  hardware  needed  to  simultaneously  initiate  multiple  operations 
is  the  number  of  hazard  checks  that  must  be  performed  in  parallel. 

In  a  vector  architecture,  because  only  one  instruction  per  clock  period  is  ever  han¬ 
dled  by  the  interlock  logic,  the  amount  of  hardware  required  to  check  for  hazards  is  about 
the  same  as  in  a  scalar  implementation.  In  addition,  no  hardware  checks  are  performed 
among  the  individual  operations  of  a  vector  instruction  because  a  coinpiler  h^  guaran¬ 
teed  that  the  appropriate  operations  have  been  grouped  into  one  vector  instruction;  hence, 

hardware  does  not  duplicate  the  work  of  the  compiler. 

In  contrast,  the  issue  mechanism  of  a  superscalar  architecture  requires  more  hard¬ 
ware  than  that  of  a  vector  processor  with  equivalent  parallelism  support.  First,  the 
logic  must  simultaneously  examine  a  minimum  of  N  instructions  as  a  requirement  for  f 
usage  of  N  functional  units.  Moreover,  because  there  must  be  a  hazard  check  between  each 
instruction  and  any  previous  one  in  the  instruction  stream,  not  only  are  there  hazard  checks 
between  each  examined  instruction  and  already-issued  instructions,  as  shown  in  Figure  3.5, 
but  there  are  also  expUcit  pair-wise  checks  among  the  about-to-issue  instructions.  Each  ol 
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Figure  3.5:  Instruction-Issue  Logic  in  a  Superscalar  Architecture 

This  figure  shows  how  the  hardware  cost  for  issuing  instructions  in  a  superscalar  architecture 
grows  as  the  square  of  the  number  of  instructions  that  are  simultaneously  examined  for  issuing^  The 
minimum  number  of  instructions  that  must  be  examined  per  clock  period  is  equal  to  the  number  of 
functional  units.  An  instruction  can  be  issued  only  if  there  are  no  data  or  structural  hoards  between 
itself  and  any  previous  instruction  in  the  instruction  stream.  Moreover,  these  checks  for  haz^ds 
must  execute  in  parallel.  Consequently,  for  each  instruction  in  the  issue  window,  there  is  interlock 
logic  for  detecting  hazards  with  instructions  already  issued  (indicated  by  the  vertical  arrows).  I 
addition,  there  is  interlock  logic  for  detecting  hazards  between  each  pair  of  instructions  in  the  issue 
window  (indicated  by  the  right-to-left  arrows). 


these  checks  must  be  implemented  in  hardware  so  that  they  can  execute  ^ 

dition  extra  hardware  is  needed  to  handle  any  instructions  with  hazards,  either  to  staU  the 
instruction  stream  at  the  first  instruction  with  a  hazard  or  to  design  the  pipehne  to  ^ow 
forwarding  of  data  [13].  A  consequence  of  this  increase  in  hardware  for  issuing  instructions 
is  that  the  design  and  diagnostics  required  for  functional  testing  are  also  more  complex  and 
hence,  more  likely  to  take  longer  to  complete. 

This  difference  in  hardware  for  issuing  instructions  is  greatly  magnified  as  hard¬ 
ware  designers  increase  the  paraUeUsm  in  a  processor.  In  a  ^«[=tor  architecture  because 
sequentially-issued  vector  instructions  allow  operations  to  be  initiated  in  parallel,  the  num 
ber  of  instructions  handled  by  the  interlock  logic  remains  at  one  even  as  datapath  par^- 
lelism  grows.  In  contrast,  as  Figure  3.5  shows,  the  total  number  of  pair-wise  hazard  checks 
required  in  a  superscalar  processor  increases  as  the  square  of  the  increase  in  datapath  par¬ 
allelism.  This  qLdratic  growth  in  hardware  was  described  as  early  as  1970  by  Tjaden  and 
Flynn  [111].  Recently,  Johnson  described  techniques  for  reducing  this  hardware  cost  but 
such  techniques  have  the  adverse  effect  of  complicating  the  hardware  design  [66].  Extra 
logic  for  handling  hazards  and  the  increase  in  debugging  complexity  also  magmfies  at  the 
same  rate  as  that  of  the  interlock  hardware. 
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3.2  The  Effectiveness  of  Vector  Architectures 

Another  common  misconception  about  vector  architectures  is  that  they  are  ef¬ 
fective  for  only  a  small  set  of  programs,  and  then  only  for  those  portions  of  a  program 
containing  loops  that  have  no  self-dependent  statements.  Moreover,  proponents  of  super¬ 
scalar  architectures  believe  that,  for  less  cost,  a  superscalar  architecture  can  take  advantage 
of  vectorizable  parallelism  as  well  as  non-vectorizable  parallelism.  I  believe  the  opposite  to 
be  true.  In  this  section,  I  address  the  issue  of  effectiveness,  arguing  that  a  vector  architecture 

is,  in  fact,  highly  effective  at  using  fine-grain  parallelism. 

There  are  three  parts  to  my  argument.  In  Section  3.2.1, 1  present  data  from  a  par- 
alleUsm  study  that  suggests  that  vectorizable  programs  have  an  abundancy  of  parallehsm 
and  that  only  a  minuscule  amount  of  parallelism  is  available  elsewhere  for  any  architecture. 
This  data  also  suggests  that  vectorizable  programs  are  likely  to  be  the  more  time-consuming 
ones  in  a  workload.  Traditional  analyses  of  this  data,  which  is  based  on  reduang  execu¬ 
tion  time,  tend  to  downplay  time  consumption.  In  Section  3.2.2,  I  discuss  an  alternate 
measure  of  improvement,  based  on  increased  workload,  that  highlights  the  effectiveness  of 
using  parallelism.  Nonetheless,  Amdahl’s  Law  reminds  us  that  ignoring  non-vectorizable 
program  fragments  completely  would  be  unwise.  In  Section  3.2.3, 1  show  how  a  combined 
superpipelined  and  vector  architecture  can  take  advantage  of  both  the  hmited  paral  ehsm 
that  is  available  in  non-vectorizable  program  fragments  and  the  abundancy  of  parallehsm 
that  is  available  in  vectorizable  ones. 


3.2.1  Where  Is  the  Parallelism? 

An  understanding  of  the  properties  of  a  vectorizable  program  fragment  shows 
intuitively  that  a  vectorizable  loop  intrinsicaUy  has  more  parallelism  than  a  non-vectonzable 
one  Although  many  hardware  and  software  techniques  are  used  to  expose  the  parallehsm 
in  a  loop,  it  is  the  presence  or  absence  of  self-dependent  statements  that  determines  how 
much  paraUelism  is  present  in  a  loop.  For  example,  while  unrolling  a  is^  software 
technique  for  exposing  more  paraUehsm,  unrolling  a  vectorizable  loop,  which  has  no  self- 
dependent  statements,  produces  more  parallelism  than  unrolling  a  non-vectonzable  one 
that  has  a  comparable  number  of  operations.  This  is  because  the  dependence  graph  of  an 
unroUed  vectorizable  loop  has  a  path  of  maximal  length,  known  as  a  critical  path,  much 
shorter  than  a  critical  path  in  the  dependence  graph  of  an  unrolled  non-vectonzable  loop. 
As  Figure  3  6  iUustrates,  a  self-dependent  statement  results  in  a  critical  path  whose  length 
is  proportional  to  the  number  of  iterations  executed.  In  contrast,  the  absence  of  a  self- 
dependent  statement  produces  a  critical  path  whose  length  is  proportional  to  the  number 
of  operations  executed  in  an  iteration.  An  indication  of  the  amount  of  parallehsm  available 
in  a  loop  is  the  ratio  of  the  number  of  operations  executed  for  the  loop  and  the  number 
of  operations  in  a  critical  path  of  the  loop.  Because  the  number  of  iterations  executed  for 
a  loop  is  typically  greater  than  the  number  of  operations  executed  for  one  iteration,  the 
critical  path  of  a  vectorizable  loop  will  be  shorter  than  that  of  a  non-vectorizable  loop  that 
ha^  a  comparable  number  of  operations.  Hence,  in  theory,  a  vectorizable  loop  has  more 

paraUeUsm  than  a  non-vectorizable  one.  ,  u  mit 

In  support  of  this  intuitive  explanation,  I  use  data  from  a  study  performed  by  W 


non-vectorizable  loop 


VECTORIZABLE  LOOP 


DO  10  I  *  1,4 

LOOP  jQ  A(I+1)  *  A(I)  ♦  B(I) 
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UNROLLING 


DO  10  I  ■  1,4 
10  A(I)  *  A(I)  +  BCD 


Figure  3.6:  Intrinsic  ParaUelism  in  Non-vectorizable  and  Vectorizable  Loops 

P  ,  1  resultant  dependence  graph  will  remain  connected.  In  contrast,  the 

r„ce'rh°of  a  „„,on,‘g,  d»i0in.  subgraphs,  sach  of  whid. 

dspendsncs  f  p..h  this  loop,  outlined  in  bold,  is  limited  to  the 

opeTionsrone  iteration  and  is,  hence,  shorWr  than  the  critical  path  in  the  non-vector..able  loop. 
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that  indicates  that  programs  with  vectoritable  loops  do  have  plenty  of  paraUelism  while  non- 

ratctort?g  conrpiler  to  ^rate  his  data,  this  study  provides  in^pen  ent  ev.dence 
that  parallelism,  when  it  exists  in  quantity,  is  suitable  for  such  an  architec  • 

*  w1^  measured,  under  a  variety  of  hardware  and  softwye  condrt.ons,  the 

leUsm  available  in  17  programs  representative  of  those  that  would  be  executed  on  a  work¬ 
station.  Wall  identifies  and  varies  three  parameters  that  affect  how  much  parallebsm  ca 

be  extracted  from  a  program: 

1.  the  level  of  branch/jump  prediction  to  find  parallelism  across  multiple  basic  blocks; 

2.  the  number  of  registers  for  renaming  purposes  to  eliminate  false  register  dependences; 
and 

3.  the  level  of  dependence  analysis  (caUed  alias  analysis  by  Wall)  to  identify  when  two 
memory  references  access  the  same  location. 

Fivure  3  7  Usts  some  of  the  parameter  values  used  by  Wall.  Another  parameter,  multiple 
fuLtional-units  is  fixed  at  64  for  this  study.  Varying  the  value  of  these  pa.rameters  resiUts 
in  different  modeU  of  computation.  The  model  of  computation  that  rs  closest  to  wha  a 
vector  architecture  can  provide  today  has  the  following  parameter  values: 

.  static  branch/jump  prediction,  which  chooses  the  most  frequent  target  based  on  a 
profile  from  an  identical  run; 

•  256  integer  registers,  256  floating-point  registers;  and 

•  perfect  dependence  analysis  of  stack  and  global  references,  and  instruction  inspec- 
Ln  to  identify  memory  dependences  among  heap  references  (called  alias  analysis  by 

compiler  by  Wall). 

Although  Wall  did  not  include  data  for  this  particular  computational  model  a  reasonable 
approxfmation  is  the  one  that  uses  perfect  dependence  analysis  because  the 
deLnstrated  for  most  of  the  computational  models  using  compiler  analysis  is  comparable 
to  what  is  exhibited  for  the  corresponding  computational  model  using  perfect  analysis. 

In  Figure  3.8,  I  have  reproduced  the  paraUelism  data  for  five  models  of  computa¬ 
tion: 

1.  bNone,jNone,r256,aNone  shows  how  much  paraUeUsm  is  extractable  using  basic  scalar 
pxecution  and  a  generous  supply  of  registers; 


2.  bNone,jNone,r256,aPerfect  indicates  the  paralleUsm  available  within  basic  blocks; 

3.  bStatic,jStatic,r256,aPerfeci  approximates  how  much  paraUeUsm  can  be  obtained  by 
a  vector  compiler  and  processor. 
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LEVEL  OF  branch/jump  PREDICTION 

— 

bNoneJNone 

branch /jump  targets  are  not  predicted  at  all 

bStatiCjjStatic 

a  branch/jump  is  predicted  to  go  to  its  most  frequent  target  as  de¬ 
termined  by  a  profile  from  an  identical  run 

bln  finite,  j  Infinite 

the  target  of  a  branch/jump  is  dynamically  predicted  based  on  a  xwo- 
bit  counting  scheme  in  which  the  table  holding  the  branch  histories 

is  infinitely  large  _ _ 

bPerfect,  jPerfect 

branch /jump  targets  are  always  correctly  predicted 

1 _ t - ::: - - 

f  NUMBER  OF  REGISTERS  1 

r256 

256  integer  registers  and  256  floating-point  registers  dynamically  al¬ 
located  in  an  LRU  fashion  _ _ _ 

rPerfect 

an  infinite  number  of  registers  to  completely  eliminaie  all  false  leg- 
ister  dependences _ _ _ _ _ _ 

LEVEL  OF  DEPENDENCE  ANALYSIS 
no  memory  dependences  are  identified,  and  all  loads  and  stores  are 

assumed  to  conflict  _ _ _  _ _ 

all  memory  dependences  are  identified,  and  loads  and  stores  conflict 

only  if  they  access  the  same  memory  location _ _ _ 

Figure  3.7:  Parameter  Values  for  Models  of  Computation 

This  table  gives  the  details  of  the  parameter  values  used  for  the  computational  models  displayed 
in  Figure  3.8.  Some  amount  of  branch/jump  prediction,  in  effect,  increases  the  basic  block  sizes  and 
hence  the  number  of  instructions  that  can  be  considered  for  parallel  execution.  A  more  accurate 
prediction  scheme  results  in  fewer  wasted  cycles  that  occur  when  the  processor  flushes  any  pending 
instructions  after  incorrectly  predicting  a  branch/jump.  Register  renaming  is  used  to  reduce  the 
number  of  false  register  dependences  that  arise  because  the  executable  code  is  compiled  for  an  archi¬ 
tecture  with  32  registers.  The  number  of  registers  provided  by  the  hardware  determines  the  number 
of  false  dependences  that  can  be  eliminated.  Because  dependence  analysis  (called  abas  analysts  by 
Wall)  identifies  instructions  that  access  the  same  memory  location,  the  /eve/ of  dependence  analysis 
affects  how  many  instructions  can  execute  in  parallel.  For  identifying  memory  dependences,  I  use 
the  two  extremes  of  dependence  analysis. 


aNone 


aPerfect 


Figure  3.8:  Measured  Parallelism  under  Various  Hardware  and  Software  Conditions 

This  graph  reproduces  some  of  Wall’s  data  {115]  showing  the  amount  of  parallelism  that  can 
be  extracted  under  five  different  models  of  computation.  These  models  vary  m  the  number  of 
available  registers,  and  in  their  ability  to  do  branch/jump  pred.cUon  and  .dentify 
involve  memory.  Figure  3.7  gives  details  on  the  values  of  individual  parameters.  The  inputs  for 
itl  modeU  a,,  .h,  .xacutabU  of  p,og,a„,s  compiled  to,  a  DECStation  5000  that  us«  a 
MIPS  R3000  processor,  a  scalar  architecture  with  32  registers. 


4.  Wn/inifej/nyinite.rPer/ecf.aPer/ect  indicates  how  much  parallelism  is  accessible  under 
nearly  impossible  conditions;  and 


5. 


bPerfect,jP€rfect,rPerfect,aPerfect  gives  the  intrinsic  parallelism  in  a  program. 


For  the  vector  computational  model,  only  the  3  programs  fpppp,  and  Unpack 

have  parallelism  above  10,  whereas  the  other  14  programs  have  parallehsm  between  4  and 
8  Of  the  three  programs,  tomcaiv  and  Unpack  contain  loops  vectorizable  by  77  the 
vectorizing  compiler  developed  by  Cray  Research,  Incorporated.  More  importantly,  these 

loops  constitute  the  bulk  of  the  time  it  takes  to  execute  each  Program^ 

The  program  fpppp  is  unique  in  that  it  contains  large  basic  blocks  [30]  that  have 

paraUelism  in  quantity.  This  is  unusual  because  large  basic  blocks  do  not  guarantee  copious 

amounts  of  parallehsm  as  evidenced  by  the  generally  low  levels  of  parallelism  exbbited 

under  a  nearly  impossible  computational  model  (labeled  bInfinite,jInfimte,rPerfect,aPerfect 

in  Figure  3.8),  which  assumes  aggressive  branch/jump  prediction  techniques  to  enlarge 

basic  blocks.  The  large  basic-blocks  of  fpppp,  however,  do  contain  “ 

of  parallelism  as  demonstrated  by  the  fact  that  only  fpppp  has  parallehsm  greater  than  10 

when  there  is  no  branch/jump  prediction,  a  generous  supply  of  registers, 

analysis,  the  model  of  computation  (labeled  bNone.jNone,r256,aPerfect  in  Figure  3.8)  that 
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measures  the  paralleUsm  available  exclusively  in  a  basic  block.  Although  current  vectorizing 
compilers  are  unable  to  identify  and  express  this  parallelism  in  vector  terms,  perhaps  wi 

more  research  fpppp  will  be  vectorizable  in  the  near  future.  ,  . ,  r  • 

Of  the  14  programs  that  have  parallelism  between  4  and  8,  ody  the  Lwermore 

Lcops  belm^k  isUwa  to  contain  any  voctomable  loopt 

entific  appUcationt,  this  benchmark  is  a  set  of  representafve  loops  M  half  of  wM  are 
loops  with  self-dependent  statements.  Hence,  this  data  bolsters  my  ®ttn 
Zgerpar^ul  to  be  fonnd  in  non-vectorizahle  loops.  Wall  states  that  the  parallebsm 
foS  tadividual  loop  in  this  benchmark  ranges  from  2.4  to  29.9  with  a  median  around  5^ 
iLTeason  a  parallelism  of  only  4.9  is  demonstrated  over  all  the  loops  is  a  consequence  of 

Amdahl’s  Law,  an  issue  I  will  address  shortly. 

Providing  increasingly  more  resources  for  branch/jump  prediction  and  ^isters 
does  not  significantly  change  this  paralleUsm  profile.  Even  undm  nearly  imposable  coni- 
tions  (Wall’s  Great  model  which  is  labeled  bInfinite,jInfintte,rPerfect,a  erfec  in  , 

oX  omculu  and  fpppp  show  a  significant  increase  in  paralleUsm  over  the  vector  mod  1  A 
mate  change  in  the  paraUeUsm  profile  occurs  when  perfect  branch  and  jump  preict  on 
rissun^eTfWaU’s  Perfect  model  which  Is  labeled  lPerJect.iPerfec,,rPerfect,aPerfec,  in 

Figme  3  8).  Under  such  impossible  conditions,  aU  but  Wk.tslones  and  I.nermore  loops 
dlltraie  paraUeUsm  greater  than  15.  Hence,  although  there  is  paraUism  intri^ic  n 
non-vectorizable  program  fragments,  it  is  not  as  easily  extracted  as  par^ebsm  in  vectoria- 
able  loops,  which  requite  only  a  good  dependence  analyzer  in  the  compiler  -  a  technology 

that  is  already  available. 


3.2.2  The  Effectiveness  of  Parallelism 

In  the  previous  subsection,  1  demonstrated  that  there  is  abundant  pai^eUsm  in 
vectorizable  programs.  In  this  subsection,  I  wiU  quantify  how  effectively  this 
Te  used  to  improve  the  performance  of  a  workload  as  typified  by  ‘k--”  ^ 

The  standard  reason  for  using  paralleUsm  is  to  reduce  execution  tune  but  because 

of  Amdahl’s  Law,  measuring  improvement  in  this  fashion 

benefits  of  paralleUsm.  For  example,  using  the  instruction  counts  bsted  in  Figure  3.9  as 
approximations  to  execution  times,  the  overall  speedup  is: 


execution  time  of  total  workload 

execution  time  of  program  i 
parallelism  ol  program  x 


=  9.81 


In  other  words,  the  total  execution  time  of  Wall’s  workload  can  k' ^y  a  f^tor 
of  less  than  10  despite  the  fact  that  the  program  that  accounts  for  44.5%  of  the  executed 
instrucuons  exhibit's  the  largest  amount  of  paraUeUsm  (44).  This  rathm  low  spe^up  results 
because  li,  the  second  longest  running  program,  exhibits  httle  paralleUsm  (5.2). 

As  an  alternative,  a  different  reason  for  using  paralleUsm  is  to  execute  larger  prob¬ 
lems  in  the  same  amount  of  time,  thus  providing  a  different  measure  of  improvement  based 
on  increasing  workload.  Gustafson  has  shown  quantitatively  the  importance  of  i^cre^'^ 
the  size  of  a  workload  to  provide  more  paralleUsm  [54].  To  faciUtate  acceptance  o 
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relatively  new  measure  of  performance,  Gustafson  et  al.  have  constructed  the  SLALOM 
benchmark,  which  compares  the  performance  of  computers 

polygons  a  computer  can  generate  in  one  minute  [53].  To  use  this  measure  for  Wall  s  data 
I  Jsume  that  the  time  contributed  by  each  program  remains  the  same  (an  ^sumption  I 
wiU  discuss  shortly).  Thus,  the  overall  increase  in  Wall’s  workload  using  parallehsm  is. 

^(execution  time  of  program  i  x  parallelism  of  program  t)  _ 
execution  time  of  entire  workload 

In  other  words,  about  25  times  as  many  instructions  can  be  executed  without  increasing 

the  workload’s  execution  time.  ,  ,  ,.a-  ^ 

These  two  ways  of  measuring  improvement  rely  upon  shghtly  different  concep 

of  what  a  workload  is.  A  workload,  when  improvement  is  measured  by  reduced  execution 

time,  is  characterized  by  a  set  of  programs  and  their 

when  improvement  is  measured  by  its  enlargement,  is  characterized  by  a  set  of  programs 

and  their  time  contributions  to  the  workload. 

The  dramatic  improvement  when  measured  by  increased  workload  is  due  to  the 

fact  that  the  most  time-consuming  program  in  Wall’s  workload  also  happens  to  exhib^ 

the  most  paraUelism,  an  important  characteristic  of  any  workload 

paraUelism  are  to  be  obtained.  Let  me  demonstrate  the  importance  of  this  characteristic 
with  a  fictitious  workload  as  a  counter-example: 

By  switching  the  parallelism  numbers  for  tomcatv  and  sed,  the  least  time- 
consuming  program  now  exhibits  the  most  parallelism.  The  improvement  of 
this  fictitious  workload  when  measured  by  reduced  execution  time  is  6.6,  30 /c 
less  than  the  corresponding  improvement  of  Wall’s  workload.  More  sigmfic^  y, 
the  improvement  when  measured  by  increased  workload  is  only  8,  a  factor  of 
three  less  than  that  for  Wall’s  workload. 

Hence  using  the  size  of  a  workload  as  a  measure  of  improvement  serves  to  highlight  the 
effectiveness^f  paraUeUsm,  particularly  if  a  workload  contains  highly  parallel  programs  that 

""  WhTn  by  increased  workload,  I  assumed  that  the  time  con¬ 

tributed  by  each  program  remains  the  same.  This  is  equivalent  to  assuming  that  programs 
with  more  parallelism  are  the  ones  that  a  user  wishes  to  execute  the  most,  a  reasonable 
assumption  given  the  context  in  which  I  am  making  my  case^  In  Wall  s  ^°^load,_the 
programs  with  the  most  paraUelism  come  from  the  scientific  and  engineering  domain.  The 
push  for  higher  performance  computers,  such  as  workstations  and  MPPs,  comes  from 
appUcaLn  domain.  Two  major  benchmarking  efforts  reflect  this:  aU  13  programs  of  the 
Perfect  Club  Suite  [12]  and  six  out  of  the  10  programs  in  the  SPEC  suite  [106]  come  from 

scientific  and  engineering  applications.  ,  , 

Given  the  fact  that  WaU’s  workload  represents  about  three  minutes  of  execution 

time  (assuming  a  25  MHz  clock  frequency,  which  is  used  in  a  SUN  SPARCstation,  and  ide 
cache  behaviour),  increasing  workload  seems  a  better  use  of  paraUehsm  rather  than  simply 

3 SLALOMis  an  acronym  for  Scalable  Language-independent  >lmes  Laboratory  One-minute  Measurement. 
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PROGRAM 

tomcatv 

li 

fpppp 

doduc 

Linpack 

espresso 

g" 

metronome 

yacc 

eco 

Whetstones 

gccl 

Livermore 

Stanford 

ccom 

egrep 

sed _ 

TOTAL 

WORKLOAD 


LINES 

EXECUTED  INSTRUCTIONS 
number  (percentage) 

PARALLELISM 

180 

1,986,257,545  (44.5  ) 

43.7 

7000 

1,247,190,509  (27.9  ) 

5.2 

2600 

244,124,171  (  5.5  ) 

29.7 

5200 

284,697,827  (  6.4  ) 

8.0 

814 

174,883,597  (  3.9  ) 

13.6 

12000 

135,317,102  (  3.0  ) 

5.0 

5883 

142,980,475  (  3.1  ) 

4.0 

4287 

70,235,508  (  1.6  ) 

5.7 

1856 

30,948,883  (  0.7  ) 

5.3 

2721 

26,702,439  (  0.6  ) 

4.7 

462 

24,479,634  (  0.5  ) 

4.4 

83000 

22,745,232  (  0.5  ) 

4.8 

268 

22,294,030  (  0.5  ) 

4.9 

1019 

20,759,516  (  0.5  ) 

4.0 

10142 

18,465,797  (  0.4  ) 

5.5 

844 

13,910,586  (  0.3  ) 

4.3 

1751 

1,447,717  (  0.03) 

7.4 

- -  9.81  harmonic  mean 

140027  4,467,440,568  (99.93)  24.11  arithmetic  mean 


Figure  3.9:  Execution  Characteristics  of  Wall’s  17  Programs 
Th«  table  lisle  sonte  cb.,.cl.,isti=s  of  the  programs  m  Wall’s  P«dM.m  '“f 

from  1  4  million  to  2.0  billion  with  an  average  of  262.8  ±  533.0  million. 

Parallelism  for  the  total  xorkload  cao  be  represented  by  two  leeijMeJ  averages,  dependi  g 
upon  how  Xmar.ce  improvement  tor  the  total  workload  is  measnred.  The  we.ght  .  a  program  s 
pLentage  of  the  workload  with  respect  to  time,  which  is  estimated  by  the  number  of  exerat 
fnsTructrns  improvement  a.  measured  by  reduced  execution  time  is  .,u, valent  to  the  we.gh«d 
kannnnic  mean  of  the  amount  of  parallelism,  where»  improvement  as 

load  is  equivalent  to  the  weighted  art. kmefc  mean  of  the  amountof  parallehsm  In  J 

the  unweighted  mean  excludes  the  lime  contributed  by  each  progr^  and  wdl  not  reject  the  fac 

that,  in  this  workload,  the  most  time-consuming  program  benelim  the  most  from  parallel  . 
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reducing  execution  time.  Continually  measuring  improvement  m  terms  of  reduced  execution 
time  wiU  ultimately  limit  the  amount  of  improvement  that  is  theoretically  possible.  In 
contrast,  measuring  improvement  in  terms  of  increased  workload  has  no  bnutation  assuming 
that  the  sizes  of  problems  can  grow  indefinitely. 

3.2.3  Addressing  Amdahl’s  Law 

I  have  just  argued  that  parallelism,  when  available  in  quantity,  can  be  used  ef¬ 
fectively  by  a  vector  architecture.  Nevertheless,  some  amount  of  parallehsm  does  ^st  in 
non-vectodzable  parts.  To  avoid  the  consequences  of  Amdahl’s  Law,  a  vector  -dutecture 
combined  with  superpipelined  hardware  can  be  used  to  take  advantage  of  the  bmited  par¬ 
allelism  (4-7)  in  such  programs.  A  superpipeUned  extension  to  a 
more  sense  than  a  superscalar  extension  because  the  instruction-issue  lope  for 
superpipelined  implementations  are  similar;  in  particular,  both  issue  only  one  instruc 
per  clock  period.  Moreover,  Jouppi  indicates  that  superpipelined  hardware  is  more  bkely 
Lke  advantage  of  parallelism  better  than  superscalar  hardware  because  of  nonumformities 
in  fine-grain  paraUelism  [69].  In  addition  to  using  the  limited  parallehsm  m 
fragme^s,  which  a  basic  vector  architecture  cannot  take  advantage  of, 

ware  can  also  be  used  in  conjunction  with  vector  hardware  to  execute  vectorizable  loops. 
In  this  subsection,  I  present  data  showing  how  effective  this  combination  works 
vectorizable  and  vectorizable  programs. 

Combining  vector  and  superpipelined  hardware  can  provide  good  sedar  perfor¬ 
mance  in  a  vector  processor  as  evidenced  by  the  Cray  machines.  Figure  3.10  compares 
the  performance  of  the  scalar  MIPS  R2000  in  the  DECstation  3100  and  the 
scalar  portion  of  the  Cray  Y-MP  executing  spice,  a  circuit-simulation  program  that  doe 
not  vectorize.  As  a  point  of  reference,  the  clock  frequencies  of  these  ma^l^nes  di 
by  a  factor  of  10:  the  Cray  Y-MP  has  a  cycle  time  of  6  ns  whereas  the  MIPS  R2000  has  a 
cycle  time  of  60  ns  [24,  92].  Because  I  want  to  emphasize  the  superpipehned  aspect  of  th 
processors  and  not  the  implementation  technology,  this  discussion  is  based  solely  on  counts 
rd:ck  penod.  As  Figure  3.U  shows,  .he  C.a,  Y-MP  has 

of  clock  periods  than  does  a  basic  scalar  processor,  such  as  the  MIPS  R2000,  which  r  g 
from  3  times  longer  for  floating-point  operations  to  8  times  longer  for  a 
On  the  other  hand,  the  MIPS  R2000  has  a  much  shorter  p 

a  data  cache.  Despite  longer  latencies,  the  CPI  (cycles  per  instruction)  of  the  Cray  Y-MP 
(4  13)  is  only  slightly  more  than  two  times  that  of  the  MIPS  R2000  (1.95). 

^  This  surprisingly  low  CPI  indicates  that  some  amount  of  parallehsm  is  being  used 

by  the  superpipehned  hardware.  This  parallehsm  can  be  quantified  comparmg  the  mea^ 
sired  CPI  with  the  calculated  CPI,  another  ratio  of  cycles-per-instruction.  The  latter  ratio 
is  the  weighted  average  of  a  processor’s  operational  latencies  where  the  weights  are  b^ed 
on  a  program’s  operational  mix.  Although  based  on  dynamic  information,  this  “«tric  does 
not  tL  into  account  the  interaction  of  the  executed  operations  and  indicates  what  the 
CPI  would  be  without  pipehned  execution.  Because  the  measured  CPI  does  reflect  parallel 
executron^he  ratio  of  the  calculated  and  measured  CPIs  (6.77^4.13  =  L64)  is  the  amount 
of  parallehsm  extracted  by  the  superpipehned  hardware  of  the  Cray  Y-M  . 

Because  superpipehning  already  improves  performance  through  parallehsm,  how 
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I - spice  <  digsT  MIPS  R2000 

Cray  Y-MP 

^cycles 

^instructions 

1711.7M 

875.8M 

2744.9M 

664.8M 

i^memory  operations 
^floating-point  operations 
^branches 
pother  operations 

OPERATIC 

325.7M  (37%) 
112.5M  (13%) 
60.0M  (  7%) 
377.7M  (43%) 

INAL  MIX 

145.0M  (22%) 
127.5M  (19%) 
49.1M  (  7%) 
343.1M  (52%) 

calculated  CPI 
measured  CPI 
average  parallelism 

RAl 

1.64 

1.95 

0.87 

MOS 

6.77 

4.13 

1.64 

Figure  3.10:  Relative  Performance  of  Superpipelined  and  Scalar  Architectures 
This  table  compares  the  performance  of  a  superpipelined  architecture,  the  Cray  Y-MP,  with 

that  oTa  scalar  architecture,  the  MIPS  R2000  rr,  the  DECstation  “  ” “^"0  “wtion 

ptostam,  .p.cc,  that  is  simulatirrg  the  circuit  behavior  of  a  digital-shitl  register.  The  oper 

latencies  of  these  „,i„g  the  hudware  performance 

The  dynamic  information  for  the  Cray  Y-Mt'  was  gainereu  s 

Cray  Y-MP,  I  asfume  all  branches  are  taken  and  a  twc^cycle  latency  for  other  operations.  For 
MTPS  R2000  all  memory  operations  are  assumed  to  hit  in  the  cache. 

Fofthe  Cray  Y-MR  the  ratio  of  the  calculated  CPI  and  the  measured  CPI  shows  how  rnuch 

parallelism  is  extracted  by  the  hardware  because  the  measured  CPI  is 

and  the  calculated  CPI  encompasses  all  the  latencies  seen  by  the  processor.  For  the  MIPS  R2000, 
the  difference  between  the  calculated  and  measured  CPIs  indicates  how  many  cycles  per  instruction, 

on  average,  are  due  to  cache  misses. 
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MIPS  R2000 

Cray  Y-MP 

memory  operation 

2  CPs 

17  CPs 

floating-point  add 

2  CPs 

7  CPs 

floating-point  multiply 

3  CPs 

8  CPs 

branch 

2  CPs 

2-8  CPs 

other  operations 

1  CP 

1-7  CPs 

Figure  3.11:  Operation  Latencies  of  Superpipelined  and  Scalar  Architectures 
This  table  shows  the  operation  latencies  of  a  superpipelined  architecture,  the  Cray  Y-MP,  and 


The  memory  operation  latency  in  the  MIPS  R2000  is  based  on  a  cache  hit.  Branch  “ 
are  any  that  can  potentially  change  the  sequential  instruction  stream,  which 
branches,  jumps,  and  calls.  The  branch  latency  in  the  Cray  Y-MP  depends  on  whether  the  branch  is 
taken  or  n^t  taken  Changing  the  instruction  stream  results  in  an  eight-cycle  branch,  whereas  fdl  g 
“oLh  Uk=s  «  cycl»  oL.  m  the  Cay  Y-MP  include  populafon-count,  logn^l  and 


shift  functions.  •  au  * 

Most  operations  in  the  Cray  Y-MP  and  the  MIPS  R2000  are  delayed  operations  in  that  indepen¬ 
dent  ones  may  execute  in  the  delay  slot(s)  of  an  operation  that  requires  more  than  one  clock  period 
to  execute  The  Cray  Y-MP  has  no  data  cache  and  relies  on  the  compiler  to  find  enough  operations 
to  fill  the  delay  slots  of  a  memory  operation.  The  only  operations  that  are  not  are  branches  in  the 
Cray  Y-MP  and  memory  operations  resulting  in  a  cache  miss  in  the  MIPS  R2000. 
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NUMBER  OF  KERNELS  3  +4  ^  +  7 - ^  I  y  1  _  14~ 

MFLOPS  CONTRIBUTION  ||  Ei€S  W.  I^tgS?  TT  "  ‘RT  ~  "g 


scalar,  unoptimized  0.600  +  0.599  +  l.u09 

COMPILATION  vector  only  0.600  +  0.599  ■*"  Z  i  or:R 

TECHNIQUE  scalar  only,  unroll  8x  0.301  +  0.528  Z  n  ofio 

_ _ vector  +  unroll  8x  0.301  +  0-528  +  0-133 _ -  0-962 - 

all  values  in  seconds  per  million  floating-point  operations 
Ri  is  the  MFLOPS  rate  of  the  P*  kernel 

H  is  the  harmonic  mean  of  the  MFLOPS  rates  of  the  14  kernels 

Figure  3.12:  Relative  Performance  of  Superpipelined  and  Vector  Architectures 

To  compare  the  performance  of  a  superpipelined  architecture  and  a  vector  architecture  with 
superpipeSg,  this  table  shows  the  contributions  of  in  MFLOPS  of  diflerent  cla^  of  kernels  to 
the  hLmonic  means  for  different  compilation  techniques  on  the  Cray-1.  Instead  of  using  harmon 
means  which  are  measured  in  MFLOPS,  1  use  sums  of  the  inverses  of  the  rates  measured  in  seconds 
per  million  floating-point  operations,  to  highlight  the  contributions^  This  analysis  is  based  on 
collected  by  Weiss  and  Smith  on  the  first  14  Livermore  Kernels  1119J. 

The  set  of  14  kernels  can  be  divided  into  three  different  classy  based  on  their  ^ 

The  three  kernels  in  the  set  5  (called  case  2  by  Weiss  and  Smith)  are  strictly  scalar.  The  f  u 
tenels  in  the  set  5?  (case  3)  could  possibly  be  vectomed,  but  not  enough 
at  compilation  time  to  accurately  determine  this.  The  seven  kernels  in  the  set  V  (case  1) 

vectomable^^  for  the  scalar  compilation  techniques  and  the  harmonic  mean  for  the  vector  compilation 
are  taken  directly  from  Weiss  and  Smith.  The  other  numbers  are  derived  as  follows^  For  vector 
compilation,  the  value  for  the  vector  kernels  is  computed  as  the  difference  between  the  inverse  of 
the  vector  harmonic  mean  and  the  sum  of  the  inverses  of  the  rates  of  the  non-vectorizable  kernels 
using  unoptimized  scalar  compilation.  For  vector  +  unroll  compilation,  superpipelined  hardware 
plus  scalar  unrolling  techniques  are  used  for  the  non-vectorizable  kernels,  and  vector  hardware  plus 
vectorization  are  used  for  the  vector  kernels. 


=  2.208 
=  1.332 
=  1.256 
=  0.962 


much  more  improvement  can  vector  hardware  provide?  Weiss  and  Smith  compared  the 
performance  of  various  scalar  compilation  techniques  [119],  allowing  me 
performance  of  optimized  scalar  code  with  that  of  vectorized  code.  Although  only  scalar 
performance  is  discussed  in  this  paper,  a  brief  comparison  with  vector  perfomance  is  made 
in  the  conclusion  of  the  paper.  The  basis  for  the  comparison  is  the  harmomc  mean  for  the 
first  14  Livermore  kernels  executing  on  a  Cray-lS.  Using  vector  code  produces  a  harmonic 
mean  of  10  51  MFLOPS,  whereas  the  best  scalar-compilation  technique,  which  unroUs  aloop 
8  times  and  uses  64  scalar  registers,  slightly  outperforms  the  vector  version  with  a  harmomc 
mean  of  11  15  MFLOPS.  Although  this  conclusion  does  not  appear  highly  supportive 
vector  aichitecterer,  I  use  a  more  detailed  breakdown  of  these  perforrnairce  summanes 
presented  in  Figure  3.12,  to  show  that  a  vector  processor,  when  it  can  be  used,  is  about 
three  times  faster  than  a  superpipeUned  scalar  processor. 
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Scalar  performance  is  comparable  overall  to  that  of  vector  performance  because 
this  is  another  instance  of  Amdahl’s  law:  a  smaller  average  speedup  over  a  larger  portion  of 
the  workload  can  result  in  better  performance  over  the  entire  workload  than  a  much  larger 
average  speedup  over  a  smaller  portion  of  the  workload.  Unrolling  scalar  code  produces  an 
average  speedup  of  2  for  10  kernels  with  only  a  marginal  average  speedup  of  1.1  for  the 
remaining  4  kernels.  In  contrast,  the  vectorized  code  showed  an  average  speedup  of  7.5 
for  7  kernels,  assuming  no  improvement  over  the  other  half  of  the  kernels.  Although  this 
could  be  interpreted  as  evidence  against  the  overall  effectiveness  of  vector  architectures,  the 
argument  could  be  also  be  used  against  superpipelined  architectures;  that  is,  seven  kernels, 
half  the  workload,  could  be  improved  by  a  factor  of  three  if  vector  hardware  were  to  be 
added. 

In  addition,  the  improvement  in  vector  performance  in  this  analysis  is  somewhat 
limited  by  less  mature  vector- compiler  technology  and  by  the  Cray-lS  implementation,  an 
old  vector  processor  by  today’s  standards  with  limited  chaining  capabilities  and  only  one 
memory  port;  hence  even  more  improvement  could  be  expected  with  modern  implementa¬ 
tions.  Finally  this  data  is  further  evidence  that  more  parallelism  is  available  in  vectorizable 
loops  —  loops  with  no  self-dependent  statements  —  and  limited  in  non-vectorizable  ones. 

In  summary,  using  superpipelined  hardware  for  scalar  program  fragments  improves 
performance  by  a  factor  of  1.6  to  2  depending  upon  the  program,  whUe  using  vector  hard¬ 
ware  improves  vectorizable  program  fragments  by  a  factor  of  about  8.  To  compare  the 
improvement  in  performance  of  an  entire  program  when  using  vector-only  hardware  and 
vector  hardware  combined  with  superpipelined  hardware,  I  use  the  following  variation  of 
Amdahl’s  Law  , 


^  +  8 

where  /  is  the  percentage  of  vectorizable  code  executed  by  a  program,  and  S  is  the  speedup 
provided  by  superpipelined  hardware.  Vector-only  hardware  has  5  =  1,  while  combined 
hardware  has  5  =  1.6  or  5  =  2  depending  upon  the  program.  The  following  table  lists  the 
oroeram  speedups  for  a  range  of  values  for  /  and  S'. 


/ 

5  =  1 

5=  1.6 

5  =  2 

0.0 

1.0 

1.6 

2.0 

(scalar-only  code) 

0.2 

1.2 

1.9 

2.3 

0.4 

1.5 

2.3 

2.8 

0.6 

2.1 

3.1 

3.6 

0.8 

3.3 

4.4 

5.0 

1.0 

8.0 

8.0 

8.0 

(vector-only  code) 

This  table  show  how  effectively  superpiplined  hardware  in  combination  with  a  vector  archi¬ 
tecture  dampens  the  negative  effect  of  Amdahl’s  Law  when  the  percentage  of  vectorizable 
code  is  less  than  80%. 


3.3  Software  Advantages  of  Vector  Architectures 

In  academia,  an  architecture  is  often  judged  only  by  its  hardware  and  performance. 
In  practice,  however,  a  commercially  successful  architecture  also  depends  on  other  aspects 
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that  are  more  software-oriented,  such  as  ease  of  use  and  whether  a  program  can  execute 
on  implementations  which  differ  in  cost  but  implement  the  same  architecture.  Commercial 
success  depends  on  such  issues  because  they  affect  how  many  people  can  use  an  architecture. 
In  this  regard,  a  vector  architecture  holds  advantages  over  a  superscalar  one,  although  these 
advantages  are  not  as  easily  quantifiable  as  the  advantages  in  hardware  and  performance. 

First,  vector  compilation  technology  is  mature,  having  been  in  development  since 
before  the  announcement  of  the  Cray-1  in  1976  [101].  Moreover,  if  a  vector  processor  were 
to  be  used  in  an  MPP,  mature  compilers  and  a  well-established  user  commumty  that  already 
knows  how  to  productively  use  such  processors  will  allow  researchers  in  the  compiler  and 
applications  community  to  concentrate  on  the  more  important  issue  of  how  to  efficiently 
distribute  a  workload  across  a  large  number  of  processors.  By  contrast,  compilation  teffi- 
niques  for  superscalar  architectures  are  still  in  the  research  and  development  phase  [9  ]. 
Using  a  superscalar  processor  in  an  MPP  would  have  the  additional  burden  of  developing 
good  compilers  for  the  processor.  Although  superscalar  compilation  techniques  may  be  able 
to  extract  parallelism  from  non-vectorizable  program  fragments,  it  is  unclear  that  there  is 
much  parallelism  to  extract  in  such  programs  (as  discussed  in  the  previous  section). 

One  area  of  concern  about  vectorization  is  that  it  typically  takes  lonpr  than  basic 
scalar  compilation;  however,  this  will  also  be  true  for  superscalar  compilation.  A  major 
difference  between  vectorization  and  basic  scalar  compilation  is  that  the  former  includes  a 
dependence-analysis  phase  that  determines  what  operations  can  execute  in  parallel  without 
changing  the  functionality  of  the  program.  Because  dependence  analysis  must  be  part 
of  any  compiler  that  generates  instructions  to  execute  operations  in  parallel,  superscalar 
compilation  will  also  include  this  phase. 

A  second  advantage  of  vector  architectures  is  that  the  concept  of  the  vector  instruc¬ 
tion  is  easily  understood  by  a  wide  range  of  people  from  high-level  language  programmers 
to  hardware  designers.  This  conceptual  simplicity  reduces  the  chances  of  implementation 
errors  at  the  hardware  and  compiler  levels.  A  simple  abstraction  model  for  expressing 
parallelism  will  become  increasingly  important  as  systems  with  more  parallelism  beconae 
available.  Such  simplicity  is  also  advantageous  for  the  end-user  who  must  use  not  only  the 
computer  but  also  the  software  that  makes  the  computer  easier  to  use  [25].  If  a  compiler 
is  not  yet  able  to  produce  vectorized  code,  the  user  could  still  resort  to  using  assembly 
language  with  vector  instructions  and  stiU  be  able  to  achieve  some  amount  of  parallelism, 
as  did  those  who  used  the  ETA  computer  at  Florida  State  University  [80].  This  would  be 
more  difficult  to  accomplish  with  a  superscalar  architecture. 

Finally,  a  vector  architecture  can  provide  binary  compatibility  across  different 
hardware  implementations  with  varying  degrees  of  parallelism  more  or  less  as  easily  as  a  su¬ 
perscalar  architecture  does,  depending  upon  the  type  of  compatibility.  Binary  compatibility 
allows  a  program  to  execute  without  recompiling  on  a  range  of  processor  implementations 
that  vary  in  cost  and  performance.  I  consider  binary  compatibility  to  be  a  software  ad¬ 
vantage  for  an  architecture  because  it  minimizes  the  impact  that  changes  in  the  hardware 
can  have  on  existing  compiler  and  application  software.  Binary  compatibility  is  also  a  way 
to  amortize  the  cost  and  development  of  a  big  VLSI  chip  over  a  large  consumer  base  by 
increasing  the  potential  market  at  both  the  high  and  low  end  of  the  cost/performance  range. 
There  are,  in  fact,  three  types  of  binary  compatibility  to  consider: 
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•  upward  compatibility,  which  allows  a  program  compiled  for  a  vector  architecture  to 
execute  on  vector  processors  with  varying  degrees  of  datapath  parallelism; 

•  scalar  compatibility,  which  allows  a  program  compiled  for  a  vector  architecture  to 
execute  on  a  scalar  processor;  and 

•  backward  compatibility,  which  improves  the  performance  of  a  program  compUed  for  a 
scalar  architecture  (typically  one  that  already  exists)  when  the  program  executes  on 
a  vector  processor. 

Upward  compatibility  allows  an  architecture  to  quickly  take  advantage  of  improv¬ 
ing  technology  and  to  increase  its  cost/performance  range  at  the  higher  end.  A  program 
compiled  with  vector  instructions  can  be  executed  on  vector  implementations  with  a  varying 
number  of  functional  units  because  the  mapping  of  a  vector  instruction  to  a  particdar  func¬ 
tional  unit  is  part  of  the  hardware  implementation  and  not  the  instruction  set  architecture. 
For  architectures  that  support  fine-grain  parallelism,  more  transistors  on  a  single  (^ip,  as 
the  result  of  improving  VLSI  technology,  allows  support  for  greater  amounts  of  parallehsm. 
As  I  have  already  discussed  in  the  first  part  of  this  section  (when  I  compared  the  hardware 
expense  of  a  vector  architecture  with  that  of  a  superscalar  one),  a  vector  architecture  is 
not  only  upwardly  compatible  with  increasing  amounts  of  datapath  parallelism,  but  it  also 
provides  this  capability  at  less  cost  than  does  a  superscalar  architecture. 

Scalar  compatibility  increases  the  range  of  an  architecture  at  the  lower  end  of 
the  cost/performance  scale.  Because  a  scalar  instruction  is  equivalent  to  its  corresponding 
vector  instruction  that  executes  one  operation,  scalar  compatibility  can  be  provided  by 
making  each  vector  register  have  one  element  each,  although  good  engineering  is  needed  so 
that  such  an  implementation  has  acceptable  performance  for  a  vector  length  of  one.  In  such 
an  implementation,  the  vector  register  file  becomes  in  essence  a  second  scalar  register  file. 
Viewed  this  way,  scalar  compatibility  in  a  vector  architecture  is  easily  provided  if  stripminmg 
is  entirely  supported  by  the  hardware,  as  it  is  in  the  IBM  3090  vector  architecture  in 
which  binary-compatible  implementations  can  have  different  lengths  of  vector  registers  [  J. 
Because  scalar  compatibiUty  is  an  issue  when  cost  is  more  important  than  performance,  the 
need  to  provide  a  scalar-compatible  implementation  will  lessen  as  larger  chips  become  less 

Backward  compatibility  allows  a  new  implementation  to  improve  the  perform^ce 
of  so-called  “dusty-deck”  programs  that  have  been  compiled  for  a  scalar  architecture.  The 
motivating  factor  for  providing  backward  compatibility  is  to  maintain  the  market  shwe  of 
an  already  existing  architecture  that  has  a  large  software  base  that  is  not  likely  to  be  re¬ 
compiled  To  accomplish  the  same  effect  as  recompilation,  which  rearranges  the  execution 
order  of  operations  to  aUow  parallelism  to  occur,  backward-compatible  hardware  uses  dy¬ 
namic  scheduling,  also  known  as  out-of-order  instruction-issuing.  To  find  instructions  that 
can  overlap  in  execution,  hardware  for  dynamic  scheduling  must  perform,  in  each  clock 
period,  pairwise  checks  for  dependences  among  several  instructions,  in  a  manner  similar  to 
what  is  done  when  issuing  instructions  in  a  superscalar  architecture.  However,  dynainic 
scheduling  for  backward  compatibility  requires  instruction-issue  logic  that  is  more  complex 
than  that  of  a  superscalar  architecture  because,  to  find  enough  instructions  to  issue  without 
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recompilation,  hardware  for  dynamic  scheduUng  must  examine  more  than  the  number  of 

instructions  that  are  actuaUy  issued  per  clock  period. 

In  fact,  Wall’s  study  on  parallelism  provides  data  showing  that  hardware  for  y 
namically  extracting  parallelism  is  extremely  expensive  for  a  relatively  small  gam  in  per¬ 
formance.  Wall’s  data,  given  in  more  detail  in  Figure  3.13,  shows  that  a  cons  an  mcre^e 
in  the  number  of  instructions  issued  per  clock  period  requires  an  exponential  powth  in  the 
number  of  instructions  examined.  For  example,  to  issue  an  average  of  3  P® 

clock  period,  an  average  of  4  mstructiona  must  be  examiaed  per 

number  of  examined  instructions  to  8  produces,  on  average,  only  3  to  5.5  instructions  th 
can  issue  each  clock  period.  Continuing  to  double  the  number  of  examined  instructions 
finds  at  best,  one  more  instruction  to  issue  for  programs  with  little  intrinsic  parallehsm 
(<  8)  and  3  to  11  more  instructions  for  programs  rich  in  parallelism  The  hardware  expense 
of  dynamic  scheduling  is  actually  in  the  hazard  checks  made  for  each  possible  pair  of 
ined  instructions.  Hence,  because  the  number  of  pair-wise  hazard 

the  square  of  the  number  of  examined  instructions,  a  constant  increase  in  P“^ 

leUsm  requires  an  exponentially-squared  increase  in  the  number  of  pair-wise  hazard  checks. 

Unless  the  market  share  for  dusty-deck  programs  is  an  overriding  concern,  pro¬ 
viding  backward  compatibility  hardly  seems  like  a  good 

architecture  even  for  programs  whose  intrinsic  parallelism  is  plentiful.  If  the  m  p 
pose  of  an  architecture  is  to  support  fine-grain  parallelism,  it  is  best  to  ^ 

Lchitecture  thus  making  backward  compatibility  less  of  an  issue.  This  is,  in  fact,  what 

most  commercial  superscalar  implementations  have  done,  TchSurL 

Sun’s  SuperSPARC  and  possibly  a  future  implementation  of  the  Intel  i386, 
that  clearly  have  a  high  investment  in  market  share.  However,  if  absolutely  necessa^ry,  the 
superpipelined  extension  of  a  combined  vector  and  superpipelined  architecture  could  pr^ 
vide  backward  compatibiUty,  although  the  simplicity  of  the  instruction-issue  logic  would  be 
gone  because  of  the  many  hazard  checks  required  for  dynamic  scheduhng. 


3.4  Summary 

In  this  chapter,  I  presented  arguments  for  why  a  vector  architecture  combined 
with  superpipelined  hardware  is  more  appropriate  for  supporting  fine-gram  parallehsm  than 
u  a  supUL  archi.ec.ure  wi.h  respec.  .o  hardware,  parfornrauce  and 
Although  either  architecture  could  be  implemented  on  a  single  VLSI  chip,  much  work 
focused  on  superscalar  architectures  but  little  attention  is  being  paid  to 
as  a  viable  VLSI  design.  This  is  because  many  designers,  in  part,  mistakenly  beheve 
vector  processor  is  expensive  to  implement  and  is  effective  for  only  a  small  set  of  programs 

I  presented  data  showing  that,  in  fact,  when  supporting  the  equivalent  amount 
of  datapath  parallelism,  a  vector  architecture  is  no  more  expensive  than  a  superscalar  one 
aid,  for  come  fea.urec,  ic  even  less  cos.ly.  Oue  fea.ure  .ha.  is  needed  by  d.her  »ch.,ec,«re 
is  a  high.bandwid.h  memory  sys.em  because  bo.h  have  a  h.gh  memory  demand  A  fogh- 
pertorLnee  cache  sys.em  could  be  used  for  a  vec.or  processor  as  *  ^ 

J.erna.ive  .o  an  expensive,  large,  higWy-in.erleaved  memory,  al.hough  furlher  research 
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Figure  3.13:  Number  of  Instructious  Issued  versus  Number  of  IrrstruCious  Examined 

Based  on  Wall’s  parallelism  data  1115],  this  graph  sho«  the  number  of  “ 

examined  and  the  number  of  hazard  checks  that  ate  ,he°'extent  to'which  the  number  of 

a  given  number  of  independent  Tp  luc  conditions,  I  use  the 
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needed  to  determine  an  appropriate  cache  organisation.  Atoth'' 

in  cost  is  the  register  file.  Although  vector  architectures  typically  use  many  more  registers 
hr  o  .nperscLr  oner,  the  aref  increase  of  a  vector  register  file 

increase  in  number.  In  fact,  doubUng  the  number  f ^5  re!d  porU  ^d 
data  sueeests  will  be  necessary,  makes  a  superscalar  register  file  with  5  read  ports  m 
fwrite  Sr  comparable  in  area  to  a  vector  register  file  similar  to  the  one  m  the  Cray  Y-MP 
tursTregisters  A  vector  architecture  also  has  simpler  instruction-issue  logic.  Fma% 
rh^e  aJInces  in  cost  favor  the  vector  architecture  .  "V"n^^ 

a  out  26  .imras  many  inftrucfions  could  be  executed  if  the  hardw^e 

use  of  the  intrinsic  parallelism  in  Wall’s  workload.  To  lessen  the  effects  of  Amdahl  La  , 

I  showed  that  superpipelined  hardware  is  effective  at  handling  non-vectorizable  prog 
Laments  and  that  Additional  vector  hardware  provides  three  times  more  performance 

vectorirable  some  of  the  software  advantages  of  a  vector  architecture:  ma- 

ture  compiler  technology,  the  vector  instruction  as  a  simple,  elegant  abstraction  for  e  - 

pressing  ^aralleUsm,  and  binary  compatibility.  These  software  advantages 

A  vector  architecture  across  a  range  of  implementations 

overlooked  by  academics,  in  part,  because  their  effects  are  difficult 

are  important  to  the  commercial  success  of  an  architecture  because  they  affect  the  num 

of  people  that  can  use  an  architecture.  ♦_  u*.  tViP  mrrent  design 

In  summary,  although  superscalar  architectures  appear  to  be  the  current  design 

of  choice  I  believe  that  a  vector  architecture  is  more  suitable  in  computers,  such  as  wor 

masLly  paralkl  procevuors,  .ha.  rely  heavily  on  VLSI  .echnology.  V«.or 

.  architecures  work  effectively  for  vectorirable  program,  which  “^“"/““J^lism 
lelism  whUe  non-vectorizable  programs  appear  to  contain  meager  amounts  of 
irsu";rllr  architectures  are  to  perform  as  well  as  vector  ones  contrary  to  Popu  -  behef 
the  hardware  implementation  of  a  superscalar  architecture  can  be  more  expensive  than  th 
of  a  vector  one  ^In  other  words,  if  the  main  purpose  of  an  architecture  is  to  support  fine- 
grain  paralleUsm,  a  vector  architecture  is  a  better  choice  than  a  superscalar  one  becau 
of  the  simplicity  of  a  vector  architecture’s  hardware,  its  natural  match  to  programs  rich 
parallelism,  and  its  established  compiler  and  application  commumties. 
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Chapter  4 

Common  Experimental 
Framework 


This  short  chapter  describes  the  common  experimental  framework  the  basic 
vector  hardware,  the  performance  tools,  and  the  workload  that  I  use  in  the  following  two 
chapters,  in  which  I  evaluate  the  performance  impact  of  changes  in  a  vectorizing  compiler 
and  in  vector  hardware.  Other  aspects  of  the  experimental  framework,  such  as  performance 
criteria  and  methodology,  differ  for  the  studies  I  carry  out  and,  hence,  are  described  in  the 
chapter  for  their  respective  study. 


4.1  Processor  Description 

The  hardware  basis  for  my  dissertation  is  the  processor  of  the  Cray  Y-MP,  which 
was  first  announced  in  1988.  A  fully-configured  Y-MP  computer  cont^ns  eight  processors; 
the  “MP”  in  the  name  stands  for  “multiprocessor.”  The  processor  itself  is  a  load/store, 
superpipelined,  vector  architecture.  The  deep  pipelines  plus  the  use  of  rather  expensive, 
bipolar  technology  result  in  an  extremely  high  clock  frequency:  167  MHz,  or  equivalently, 
a  6  ns  clock  period.  As  a  point  of  reference,  in  1991,  most  microprocessors  have  a  clock 
frequency  between  25  and  40  MHz  with  the  higher  performance  ones  having  63  MHz  (the 
Hewlett-Packard  Snake)  and  100  MHz  clocks  (the  MIPS  R4000).  Following  are  detaUs 
on  the  organization  of  the  registers  and  functional  units  that  are  relevant  to  my  thesis. 
Other  details  about  the  Y-MP  processor  are  available  in  the  Cray  Y-MP  Computer  Systems 
Function  Description  Manual  [24]. 

Figure  4.1  lists  the  register  files  in  the  Y-MP  processor.  The  vector  register  file, 
which  is  the  focus  of  this  thesis,  can  be  viewed  as  a  partitioned  one  (see  Section  2.2.2) 
in  which  each  vector  register  is  comparable  in  organization  to  a  scalar  register  file.  A 
vector  register  consists  of  64  dual-ported  registers  that  are  attached  to  read  and  write  buses 
common  to  the  vector  register.  Because  of  the  separate  read  and  write  buses,  chaimng  is 
possible  between  any  vector  instructions.  For  my  thesis,  I  examine  different  configurations 
of  the  vector  register  file.  In  Chapter  5,  Register  Usage  and  Instruction  Scheduling,  I 
experiment  with  the  number  of  vector  registers,  and  in  Chapter  6,  Bus  Usage  and  Register 
Assignment,  I  explore  the  implications  of  having  more  than  one  vector  register  share  a  set 
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REGISTER 

WIDTH 

ORGANIZATION 

TOTAL 

NUMBER 

FUNCTION 

FILE 

A 

32  bits 

8  registers 

OF  BYTES 

32  bytes 

store  addresses  or  integer  dat 

B 

32  bits 

64  registers 

256  bytes 

back-up  for  A  register  file 

S 

64  bits 

8  registers 

64  bytes 

store  integer/FP  scalar  data 

T 

64  bits 

64  registers 

512  bytes 

back-up  for  S  register  file 

V 

64  bits 

8x64  registers 

4096  bytes 

store  integer/FP  vector  data 

Figure  4.1:  Register  Files  of  the  Cray  Y-MP  Processor 

This  table  shows  how  the  five  register  files  of  the  Y-MP  processor  vary  in  size,  organization, 
and  functionality.  Note  that  data  and  address  words  differ  in  size:  data  are  64-bits  and  addresses 
are  32-bits.  Because  of  the  limited  capacity  of  the  the  A  and  S  register  files,  the  back-up  register 
files,  B  and  T  respectively,  serve  as  temporary  storage  that  is  faster  to  access  than  main  memory. 
All  the  register  files  are  connected  to  memory  ports,  which  are  functional  units  that  serve  as  the 
interface  between  the  Y-MP  processor  and  its  memory  system.  The  register  files.  A,  S,  and  V,  are 
also  connected  to  the  other  functional  units,  while  the  back-up  ones  are  not. 


of  read  and  write  buses. 

There  are  also  individual  registers  for  special  purposes.  The  vector  length  register 
VL  specifies  the  number  of  operations  that  a  vector  instruction  is  to  execute.  The  maximum 
number  is  64,  which  is  the  number  of  elements  in  a  vector  register.  The  vector  mask  register 
VH  is  a  64-bit  register  that  is  set  and  used  by  special  vector  instructions  for  conditional 
selection  of  data.  For  example,  the  instruction  V0<-VM?V1:V2  (written  with  C-like  syntax) 
transfers  the  i*'*  element  of  VI  to  the  i"*  element  of  VO  if  the  bit  of  VM  is  equal  to  1; 
otherwise  the  element  of  V2  is  transferred.  The  VM  register  and  its  associated  instructions 
permit  some  loops  with  conditional  statements  to  be  vectorized. 

The  Y-MP  processor  has  nine  special-purpose  functional  units: 

•  two  load  ports 

•  a  store  port 

•  a  floating-point  adder 

•  a  floating-point  multiplier 

•  a  floating-point  reciprocal  unit 

•  an  integer  unit 

•  a  logical  unit 

•  a  shifter 

The  logical  unit  is  used  in  conjunction  with  the  VM  register  for  conditional  selection.  A 
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second  logical  unit  is  optional  but  because  the  simulator  does  not  model  this,  I  choose  to 
ignore  it. ^Division  is  computed  using  the  reciprocal  unit  and  the  multipher  in  four  steps 

('one  reciprocal  approximation  and  three  multiplications).  ,  , . , 

The  Y-MP  processor  is  rich  in  memory  bandwidth  with  a  total  of  three  memory 
ports  whereas  most  processors  have  only  one  bi-directional  memory  port.  Furthermore 
the  Y-MP  processor  has  gather/scattcr  hardware  that  provides  a 

addressing  to  allow  nested  array  references  to  be  vectorized.  For  example,  to  load  the  data 
specified  by  the  array  reference  A(I(K))  into  vector  register  VI,  only  two  vector  memory 

instructions  are  executed; 

AO  <~  base  address  of  array  I() 

A1  <~  base  address  of  array  A() 

VO  <-  MC  AO  ] 

VI  <-  M[  Al+VO  3 

To  compote  the  effective  addresses  for  the  second  memory  mstmetion, 

ware  adds  the  elements  in  the  vector  register  VO  to  the  b^e  address  ■” ^ 

refers  to  memory  loads  that  use  a  vector  register  as  an  index  register  whereas  scatter  refer 

to  stores.^^^^  functional  units  are  fuUypipelined  so  that  a  new  operation  can  begin  «e^ 
cuting  every  clock  period  in  each  functional  unit.  Pipelined  memory  accesses  are  provided 
Tan  interLved  Lmory  system.  However,  because  individual  memory  b-ka  “e  n« 
pipeUned  and  have  an  access  time  that  is  greater  than  one  clock  period, 
uL  that  reference  the  same  bank  take  longer  to  execute  when  they 

succession  Such  access  conflicts  to  a  memory  bank  cause  execution  delays  in  the  vector 
memory  instructions  that  generated  the  references,  preventing  these  mstructions  and  any 
vector  instructions  chained  to  them  from  achieving  full  pipelined  execution. 

Although  functional  units  can  execute  simultaneously,  there  are  some  restriction 
on  the  simultaneous  use  of  the  memory  ports  when  using  gather  or  scatter  instructions^ 
Even  though  a  gather  instruction  and  a  scatter  one  use  different  memory  ports,  only  one 
these  can  occur  at  a  time.  However,  a  gather  can  occur  in  conjunction  with  a  simple  load 
or  store,  and  a  scatter  can  execute  in  parallel  with  one  or  two  loads. 


4.2  Performance  Tools 

To  generate  the  raw  data  used  for  my  performance  studies,  I  use  modified  veraons 
of  Crav  Research’s  production  FORTRAN  compiler,  which  is  named  c/f77,  and  the  Cray 
Y-MP  ^simulator.  Figure  4.2  iUustrates  the  relationship  of  these  two  tools.  Both  tools  are 
narameterized  so  that  the  number  of  vector  registers  can  vary  up  to  a  maximum  ol  64. 

^  The  ejm  compiler  is  a  vectorizing  one,  and  performs  global  and  local  optimizations 

on  both  the  scalar  and  vector  code  that  it  generates.  I  am  specifically  interested  in  the 
instrvetta  scheduUng  and  register  assignment  phases.  An  Inst^rnction  sehednler 
an  appropriate  order  in  which  operations  in  a  dependence  graph  execute  P^''™ 

dependences  among  the  operations,  and  a  register  ass.gner  determines  ^ich  repster  stores 
the  value  produced  by  an  operation.  In  the  c/177  compiler,  the  scheduling  phase  occurs 
before  theLsignment  phase,  a  sequence  that  I  assume  when  describing  algorithms  for  these 


Figure  4.2;  Performance  Tools 

This  figures  shows  how  I  use  Cray  Research’s  vectorizing  compiler,  called  c/i77and  Y-MP  sim¬ 
ulator  in  my  performance  studies.  Both  tools  have  been  modified  so  that  I  can  specify  the  num  er 
of  vector  registers  to  use.  up  to  a  maximum  of  64.  The  instruction  scheduler  and 
are  highlighted  because  these  are  the  phases  that  I  will  concentrate  on  m  this  dissertation.  Althoug 
not  explicitly  shown,  inputs  to  earlier  phases  are  also  inputs  to  later  phases.  In  other  words,  inpu 
to  a  phase  is  augmented  with  more  information,  all  of  which  is  passed  on  to  the  next  phase. 
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two  phases.  Consequently,  input  to  the  instruction  scheduler  is  a  dependence  graph,  wWch 
is  generated  by  the  dependence  analyzer,  and  input  to  the  register  assigner  is  an  execution 

order  for  that  dependence  graph.  *  i, 

The  simulator  emulates  every  aspect  of  a  Y-MP  processor  and  can  k^p  track  of 

simulated  execution  time.  The  behavior  of  the  instruction  buffer  is  accurately  “O^ele  ^ 
Only  simple,  memory-bank  conflicts,  such  as  those  occurring  within  one  stream,  are  t^en 
into  account.  Memory  conflicts  between  two  independent  reference  streams  are  ignore  . 


4.3  Workload 

In  addition  to  performance  tools,  an  appropriate  workload  is  needed  for  my  per¬ 
formance  studies.  Because  my  research  experiments  explore  aspects  of  vector  design’  use 
a  set  of  36  vectorizable  loops  that  was  collected  at  Cray  Research  , 

by  their  architects  to  evaluate  future  designs.  These  loops,  which  I  collectively  call  the 
CRI  workload,”  have  been  extracted  from  actual  applications  used  by  Cray  '^^stoiners  and 
are  written  in  FORTRAN.  They  include  kernel  7  and  the  second  loop  of  kernel  18  fr 
Livermore  Loops  [84]  plus  several  loops  extracted  from  the  Perfect  Club  benchmark  suite 
[12]  In  addition,  these  loops  contain  several  program  constructs  that  are  tradition^y  con¬ 
sidered  difficult  to  vectorize.  Examples  are  scalar  reductions,  array  references  with  nested 
indices,  conditional  statements,  and  calls  to  intrinsic  functions.  ,  .  i  n 

Seven  of  the  36  loops  in  this  workload  consist  of  more  than  one  basic  block.  One 

reason  for  this  is  that  a  loop  containing  a  conditional  IF  . . 

at  least  three  basic  blocks.  For  example,  one  loop  contains  nine  ®  ® 

resulting  in  27  basic  blocks.  Vectorizing  scalar  reductions  also  produces  multiple  ba^c 
blocks:  one  for  computing  a  vector  of  practical  sums  and  another  to  calculate  the  fin  su_ 
(Section  2.3.1  describes  the  software  transformation  for  computing  a  scalar  reduction  g 

vector  instructions).  j 

In  the  cft77  compiler,  each  basic  block  is  represented  by  a  dependence  graph 

in  which  a  vertex  is  a  vectorizable  operation  and  there  is  an  arc  from  one  operation  to 
another  if  the  first  operation  produces  a  result  used  by  the  second  one.  A  vectorizable 
operation  is  eventuaUy  converted  into  a  vector  instruction  in  the  code-generation  phase  of 
tL  compiler.  The  vectorizable  operations  in  this  workload  comprise  a  mixture  of  floating¬ 
point  and  integer  operations.  Because  some  loops  contain  more  than  one  basic  block,  there 

are  in  fact  88  dependence  graphs  for  the  36  loops. 

An  important  characteristic  of  these  loops  is  the  substantial  variance  in  the  number 
of  vectorizable  operations,  which,  in  turn,  results  in  a  wide  range  in  the  execution  times 
for  one  iteration.  Figure  4.3  iUustrates  this  diversity.  Two-thirds  of  the  loops  contain  more 
than  30  vectorizable  operations,  and  the  execution  time  for  one  iteration  ranges  from  2  dock 
periods  to  about  300  clock  periods.  This  considerable  variance  is  important  because  in  aU 
Ukelihood,  the  larger  loops  wiU  have  more  paraUelism  and  hence  require  more  registers  B 
the  workload  consisted  of  loops  with  less  than  30  operations,  I  could  erroneously  conclude 
that  eight  vector  registers  is  sufficient  for  the  Cray  Y-MP  functional  unit  configuration. 

Figure  4.3  also  shows  that,  with  the  exception  of  one  loop,  more  than  one  vec¬ 
torizable  operation  is  executed  per  clock  period,  demonstrating  that  a  vector  architecture 


Kernels  Sorted  by  Average  Execution  Time  Per  Iteration 
Figure  4  3:  Vectorizable  Operations  and  Execution  Time  of  the  CRI  Workload 

address  and  branch  computations  are  executed  each  intrinsic  functions 

rcrit;\r‘:r»"'No:eC4S  eatra^LsTrn’cVons  nor  the  scalar  instructions  are  included  in  the 

n^rrut:  tiTof  f,:;  .s  pmued  above  is  the  average  .me  to  execute  o^^atmu^of 
'UH  Kv  rH77  Cray  Research’s  vectorizing  FORTRAN  compiler,  for  the  C  y 
:s!r/e”"r  r^:"”:  T^":r.,.,  »h,ch  is  clulated  as  the  time  to  en-tejhe  entire 
loop  divided  by  the  number  of  iterations  executed,  includes  any  time  spent  execu  mg  P 
instructions,  strip  overhead,  or  intrinsic  functions.  _ _ _ 
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does  make  use  of  fine-grain  parallelism.  The  average-per-iteration  time  for  the  one  loop 
(number  16  in  the  graph)  includes  computing  a  square  root  whose  instructions  are  excluded 
from  the  count  of  operations.  In  fact,  because  the  average- per-iteration  times  for  afi  the 
loops  includes  the  execution  time  for  instructions  other  than  the  vectorizable  operations, 
the  amount  of  parallelism  capitalized  on  by  the  Y-MP  vector  architecture  is  greater  than 
what  is  illustrated  in  this  graph. 


4.4  Summary 

In  summary,  with  the  cooperation  of  Cray  Research  Incorporated,  I  have  access 
to  benchmarks,  a  production  vectorizing  compiler,  and  a  simulator.  In  the  Mowing  two 
chapters,  I  will  modify  the  Cray  compiler  and  use  the  benchmarks  and  simiUator  to  evaluate 
the  performance  impact  of  changes  in  the  Cray  Y-MP  vector  processor  and  compiler. 
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Chapter  5 

Register  Usage  and 
Instruction  Scheduling 


In  Section  2.2  (of  Chapter  2,  Fundamentals  of  Vector  Architectures),  I  outUned  the 
general  hardware  requirements  needed  for  supporting  fine-grain  parallelism.  In  particular, 

I  stated  that  both  multiple  functional  units  and  an  appropriate  organization  for  a  register 
file  are  equally  important  for  allowing  parallelism  to  occur.  In  support  of  this  statement, 

I  presented  data  from  an  independent  study  showing  that  the  number  of  regis¬ 

ters  from  32  to  512  increases  the  amount  of  achievable  parallebsm,  which  resu 

utilization  of  64  functional  units  (see  Section  3.2  of  Chapter  3,  >1  Ccsc  for  ^  ^ 

tectur^s).  Hence,  the  number  of  registers  must  be  balanced  with  the  number  of  f'^«^iond 
units  if  enough  parallelism  is  to  occur  to  use  the  hardware  efficiently.  H  there  axe  too  few 
registers  relative  to  the  number  of  functional  units,  the  functional 

their  fuUest  potential.  If  there  are  too  many  registers,  the  functional  uriits  will  be  effecti  y 
used  but  the  register  file  wiU  be  over-designed.  In  other  words,  a  hardware  designer  needs 
to  know  the  minimum  number  of  registers  required  to  use  a  given  number  of  f^^^ctiona^ 
units  effectively  across  a  range  of  programs.  Determining  this  number  requires  a  study  that 
examines  both  the  performance  and  cost  of  implementing  a  given  number  of  registers. 

In  this  chapter,  I  focus  primarily  on  the  performance  aspect  of  implementing  vec¬ 
tor  registers  in  the  Cray  Y-MP  vector  architecture,  and  defer  the  cost  analysis  to  the  n«t 
chapter.  There  are  8  vector  registers  connected  to  9  special-purpose  ffinctional  uiu  s  m 
Y-MP  processor.  Because  I  want  both  of  these  components  to  be  well  utilized,  I  begin  iny 
investigation  by  asking  “Would  more  vector  registers  significantly  improve  performance, 
and  if  so,  “How  many  more  vector  registers  are  needed  before  performance  no  longer  im- 
oroves’”  These  questions  form  the  primary  goal  of  this  study,  which  is  to  determine  e 
minimum  number  of  vector  registers  that  can  effectively  use  the  9  special-purpose  function^ 
units  in  the  Cray  Y-MP  vector  processor.  To  produce  the  desired  results,  I  need  to  also 
demonstrate  that  the  instruction-scheduling  algorithm,  which  is  used  in  the  code-generation 
phase  of  a  compiler,  has  a  major  impact  on  the  performance  of  the  generated  code.  Tffis 
secondary  goal  was  not  obvious  at  the  outset  of  this  study  and  became  apparent  only  after 

I  had  analyzed  some  preliminary  performance  data. 

For  this  study,  I  vary  only  the  number  of  vector  registers,  leaving  the  maxima/ 
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vector  length,  which  is  the  number  of  elements  per  vector  ^  ^Tctton  2  3  2 

which  is  the  amount  of  unrolling  provided  by  stripmmed  code  (described  in  Section  2.3.  . 

pages  24  to  29),  is  used  to  hide  both  the  latency  of  vectorizable  operations  and  the 
oficalar  operiUs,  an  effect  that  is  more  dominant  when  the  t7MTs«ffld^tfy 

for  a  loop  is  greater  than  the  maximal  vector  length  A  maximal  «  c!7y  MP 

long  to  effectively  hide  the  latencies  of  the  operations  implemented  m  the  Cray  Y-MP. 
Shortening  vectof  length  to  be  less  than  64  is  likely  to  either  not  affect  performance  or  even 
H  htLuse  operational  latencies  can  no  longer  be  effectively  overlapped  with  the 
e777of  7her  operations.  Increasing  vector  length  beyond  64  is  unlikely  to  improve 
perCLce.  Furthermore,  in  order  to  be  effective,  a  longer  vector  length  would  reqmre 
more  iterations  to  be  executed  for  a  loop,  a  factor  which  is 

rather  than  a  compUer.  On  the  other  hand,  increasing  the  number  of  vector  regis 
d^  Lpr^e  performance  significantly  as  1  will  demonstrate  in  this  dapter.  Mor»v« 
this  performance  improvement,  although  somewhat  dependent  upon  charactOTstics  of 
appli^tion,  is  also  influenced  by  a  compiler’s  algorithm  for  schedubng  inst^ructions. 

The  version  of  the  c/^  77  compiler  used  for  this  chapter  f  ^^f 
to  me  during  my  work  term  at  Cray  Research,  Incorporated  in  the  m  of  1990.  Since  t 
time  a  newer  version  of  cft77  has  been  released  that  uses  a  schedubng  ^gorithm  si^l 
to  the  one  I  developed  [62].  Nonetheless,  for  the  sake  of  brevity,  I  use  the  term  ft 
interchangeably  with  the  phrase  “the  1990  version  of  cft77.” 

To  explain  how  I  formulated  the  goals  for  this  investigation,  I 
registers  can  improve  performance  and  present  some  initial  performance  data  that 
schedubng  algorithm  different  from  the  one  used  in  the  cft77  compiler  is  needed  o 
redstel^s'^effe^ctively.  Then,  after  describing  the  performance  criteria  and  inethodology  I  use 

to  perform  my  experiments,  I  contrast  the  scheduling  ^ V  Fin^V 

with  an  algorithm  that  I  developed  and  that  is  a  variant  of  hst  schedubng.  Finally,  I  present 
a  se^t  of  performance  data  showing  that  the  schedubng  algorithm  does  have  a  major  impac 
on  performance  and  another  set  of  data  that  determines  a  cost-effective  number  of  registers 

for  the  Cray  Y-MP  vector  processor. 

5.1  More  Registers  and  A  Different  Scheduling  Algorithm 

In  this  section,  I  show  how  the  number  of  vector  registers  and  the  scheduling 
algorithm  affect  performance.  First,  I  work  through  an  example  to  show  how 
infprove  performance  and  use  this  example  as  the  inspiration  for  the  primary  goal  of  th 
tifrlv  Additionaby  this  example  demonstrates  how  determining  the  appropriate  balance 

among  vector  hardware  and  two  aspects  of  the  code-generation  phase  of  ^  ^ 

instruction  scheduler  and  the  register  assigner.  In  the  second  part  j 

an  initial  performance  study  whose  apparently  conflicting  results  inspire  the  secon  y  g 

of  this  study. 

^Because  the  scheduUng  algorithms  "Ascribed  in  this  chapter  can  treat  vector  registers  and  registers 
analogously,  I  use  the  terms  weclor  register  and  register  interchangeably. 
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5.1.1  Why  More  Registers? 

Because  usiug  more  registers  requires  increasiug  the  number  ^  “ 

hardware  it  must  be  justiSed  by  a  significant  reduction  m  execution  time.  Csmg  more 
Jegtem  mlTes  execution  time  in  two  ways.  First,  more  registers  allow  more  aggrestuve 
scLduling  which,  in  turn,  allows  more  parallebsm  to  occur.  More  parallebsm  causes  more 
;“t:“lts  to  be’generated  at  any  one  time  and  hence,  mme 
to  hold  these  intermediate  results.  Reducing  execution  time  in  this  way  also  allows 
functional  units  to  be  used  more  frequently. 

a  vector  architecture,  in  contrast,  these  extra  instructions  are,  in 

that  save  and  restore  the  contents  of  vector  registers.  However,  in  contract  to  a  scalar 
architecture,  the  time  to  execute  these  extra  instructions  can  be 

instructions  Reducing  register  spills  also  lessens  the  demand  on  memory,  but  this  is  less  o 
“  l"rwL  „  ahu'nda^ce  of  Lmory  bandwidth  is  provided  in  the  implementation,  as 

i«:  thp  case  for  most  vector  architectures. 

In  short  reducing  the  need  for  register  spiUs  has  minimal  impact  on  execution 

time  in  vector  Ir’chLturSs,  and  the  better  reason  for  adding  more  »  “ 
more  paraUelism  to  occur.  Later  in  this  chapter,  1  present  data  that  supports  both  of  th 

To  demonstrate  how  register  usage  affects  how  much  parallelism  ' 

vectorizable  loop  in  Figure  5.1,  which  is  represented  by  the  «°toiing 

•rr  ihut  fifftirp  Once  the  vectorizable  operations  of  a  loop  are  identihed,  a  vectorizing 

compiler  determines  an  order  in  which  these  operations  can  exec^e.  ^ 

corresponds  to  an  instruction  sequence,  I  use  the  term  execution  order  to  ^ 

that  these  are  still  operations  that  are  eventually  translated  into  instructions. 

I  prefer  the  term  exLtion  order  to  evaluation  order,  which  is  more  commody  used  by 

the  compiler  community  [2],  because  hardware  does  not  ^ 

executes  them.)  Any  execution  order  is  permissible  as  long  as  the  loop  s  functionality  doe 
not  cLnge  which  wiU  happen  if  the  data  dependences  among  the  vectorizable  operations 
are  preserved.  In  other  words,  a  correct  execution  order  is  one  in  which  aU  ancestors  of 

ooeration  in  a  dependence  graph  are  executed  first. 

Although  a  dependeuce  graph  of  a  loop  specifies  a  partial  order  for  correct  fuuc- 
tiouality  an  enormous  Lmber  of  execution  orders  satisfy  that  partial  order.  To  take  a 
som^hlt  trivial  example,  the  first  seven  operations  of  a  correct  execution  order  for  the  d^ 
pendence  graph  in  Figure  5,1  could  be  the  seven  loads.  Because  these  loads  do  not  depend 
on  any  other  operations,  they  can  be  executed  in  any  order,  which 

5040  afferent  correct  execution  orders  for  the  first  7  operations  alone.  All  of  these  different 
exeLion  orders  are  erocllp  equivalent  in  functionaUty  because  the  dependence  PoP>>- “ 
addition  to  specifying  a  partial  order,  also  specifies  how  the  results  of  the  operations  ar 


DO  40  1=1,1 

RA  =  W(I.l)+X(I.l) 

RB  =  W(I.2)+X(I,2) 

RC  =  W(I,3)+X(I.3) 
FS(I)  «  SX(I)*RA*RB*RC 
40  CONTINUE 
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Figure  5.1:  Source  Code  and  Dependence  Graph  for  Sample  Loop 

This  figure  presents  the  vectorizable  loop  that  I  use  extensively  in  this  chapter  and  in  the  next 
one  to  Ltivate  the  studies  I  perform.  This  loop  is  a  modified  version  of  one  from  the  CRT  wo^  oa^ 
On  the  left  is  the  FORTRAN  source  code  and  on  the  right  are  the  "ector^ab  e 

operations  for  each  FORTRAN  statement.  Each  operation  is  identified  by  the  ‘yP®  ^ 

eLcutes  and  a  unique  number  as  a  subscript  to  the  operational  type.  At  the  bottom  is  a  dependen 
graph  that  shows  the  data  dependences  among  the  vectorizable  operations  _  A  vertex  m  such  a 
graph  is  a  vectorizable  operation,  and  an  arc  is  a  dependence  where  the  direction  of  the  ^ 

the  producer  of  the  dependent  value  to  the  consumer  of  the  value.  In  Chapter  2,  Fundamentals  of 
Vector  Architectures,  I  discuss  data  dependences  in  greater  detail  and  also  describe  how  a  vectorizing 
compiler  identifies  vectorizable  operations  and  constructs  a  dependence  graph. 
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to  be  combined.  In  contrast,  Figure  5.2  shows  two  dependence  graphs  that  axe,  in  th^ry, 
functionaUy  equivalent  but  which  combine  the  results  of  operations  in  different  orders.  This 
figure  also  explains  why,  in  practice,  such  dependence  graphs  are  not  exactly  equivalent. 

There  are  two  main  differences  between  all  the  correct  execution  orders  for  a 
dependence  graph: 


1.  the  time  needed  to  execute  the  order,  and 

2.  the  minimum  number  of  registers  needed  to  execute  the  order  without  having  to  spill 
registers. 

For  example.  Figure  5.3  shows  two  correct  execution  orders  for  the  dependence  graph  in 
Figure  5.1.  Because  all  the  dependence  arrows  point  downward,  both  these  orders  satisfy 
the  partial  order  specified  by  the  dependence  graph.  Nonetheless,  these  two  orders  differ  in 

their  execution  times  and  their  minimal  register  requirements.  , 

To  demonstrate  that  the  execution  times  of  the  two  orders  in  Figure  5.3  axe  differ¬ 
ent,  I  use  a  technique  called  chime  counting  to  provide  a  quick  estimate  of  execution  time. 
A  chime,  which  originally  was  an  abbreviation  for  “chain  time”,  is  a  umt  of  time  that  is 
approximately  equal  to  the  time  it  takes  to  execute  one  vector  instruction.  Instructions 
that  use  different  functional  units  can  execute  in  the  same  chime  in  the  absence  of  any 
access  conflicts  among  registers.  For  example,  for  the  execution  orders  in  Figure  5.3  the 
first  two  loads  and  the  addition  execute  in  the  same  chime  because  the  Cray  Y-MF  has  two 
load  ports  and  chaining  hardware.  Conversely,  instructions  that  use  the  same  functional 
unit  must  execute  in  different  chimes.  For  example,  the  fourth  operation  (LOAD4)  must 
execute  in  the  second  chime  because  the  load  ports  are  already  each  executing  a  vector  load 

instruction.  ,  1  r.. 

Continuing  in  this  fashion,  we  see  that  the  execution  order  on  the  left  executes  in 

6  chimes,  and  the  one  on  the  right  executes  in  4  chimes.  Because  each  operation  of  a  vector 
instruction  corresponds  to  an  iteration  of  a  loop  (see  Section  2.3  in  Chapter  2,  ;,  executing  a 
loop  with  vector  instructions  in  t  chimes  corresponds  to  executing  one  iteration  of  that  loop 
in  approximately  t  clock  periods.  Chime  counting,  in  fact,  provides  an  optimistic  estimate  of 
the  per-iteration  execution  time  because  the  temporal  impact  of  loop-  and  strip-overhead  is 
ignored.  To  verify  that  this  estimate  is  reasonable,  I  executed  these  two  orders  on  a  Cra,y  Y- 
MP  for  100  iterations  each;  the  per-iteration  time  of  the  order  on  the  left  is  7.5  dock  periods, 
and  that  on  the  right  is  5.7  clock  periods.  This  is  a  difference  of  about  2  clock  periods  as 

predicted  by  chime  counting.  ,  •  *1,^ 

In  addition  to  differing  in  their  execution  times,  these  orders  also  difter  in  tne 

minimum  number  of  registers  needed  to  execute  the  operations  without  spilling  registers. 
To  demonstrate  this,  I  must  first  explain  how  a  compiler  uses  registers  and  then  how  to 
determine  the  minimal  register  requirement  of  an  execution  order.  A  compUer  uses  a  vector 
register  to  store  a  vector  of  values  produced  by  a  vectorizable  operation.  If  a  compiler 
assigns  each  result  to  a  different  register,  too  many  registers  wiU  be  used;  fewer  registers 
could  be  used  without  increasing  the  execution  time  predicted  by  chime  counting. 


*My  thanks  to  James  E.  Smith  for  this  etymological  fact. 
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DO  40  1=1,1 

RA  =  W(I,l)+X(I.l) 

RB  =  W(I,2)+X(I.2) 

RC  =  W(I.3)+X(I.3) 

FS(I)  =  t-C  {nxCD^RA}  *RB}  *RC> 
40  COSTIIUE 


DO  40  1=1.1 

RA  =  W(I.l)+X(I.l) 

RB  =  W(I,2)+X(I,2) 

RC  =  W(I,3)+X(I.3) 

FS(I)  =  •CSX(I)*RA>  *  -CRB^RO 
40  COHTIHUE 


Figure  5.2;  Two  Dependence  Graphs  for  the  Sample  Loop 

These  dependence  graphs,  both  of  which  represent 
combine  the  results  of  the  corresponding  source  code.  Although  these 

associativity;  mstead.  these 

graphs  are  considered  equivalent  within  orders  they  also  specify  different  partial 

^  Because  both  graphs  combine  the  results  in  ‘^'^X  ^For  eW^^^  «  the  only  order 

orders  and,  hence,  different  sets  of  correct  execution  orders.^  For  examp 

in  which  the  three  multiplications  can  be  placed  y  dependence  graph  on  the 

graph  on  the  left,  and  ♦i2*i4^3  or  *i4  12  13  are  the  oniy 

U  Piffiirp  5  1  which  does  not  have  parentheses 

By  taking  as  input  the  rompiler  would  produce  the  dependence  graph  on  the 

to  explicitly  group  ^^ich  specify  that  a  series  of  operators  of  the  sarne 

left,  in  ht  On  the  o^her  hand,  an  optimizing  compiler  may  produce  the 

class  IS  grouped  from  left  to  ngn  .  naTallelism  available;  two  multiplications 

dependence  graph  on  the  right  to  increase  t  e  amoun  whereas  all  three  multiplications 

can  be  executed  in  parallel  using  the  dependence  graph  on  the  6“;^^“'“ 

are  executed  .e,u.„tia,„  using  the  -Uiplie^  to  tahe 

the  nrore  than  one  dependence  graph  can  represent  a  loop 

tZTud',tu"dependence  graphs  that  most  closely  folio*  the  arithmetic  conventrons  of 
the  FORTRAN  language. 
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CHIME  #LIVE 


STORE, 


Figure  5.3:  Two  Execution  Orders  for  the  Example  Loop 

These  are  two  execution  orders  that  satisfy  the  partial  order  specified  by  the  dependence  graph 
in  Figure  5.1.  For  each  order,  I  have  shown  which  operations  execute  in  the  same  chime  when  the 
hardware  has  chaining,  two  load  ports,  one  adder,  one  multiplier,  and  one  store  port.  I  have  also 
listed  for  each  order,  the  number  of  values  that  are  live  in  each  chime.  The  lifetime  of  a  value  is 
indicated  by  the  dependence  arrow  that  connects  the  producer  and  the  consumer  of  that  value,  and 
the  number  of  live  values  in  each  chime  is  equal  to  the  number  of  dependence  arcs  that  appear  in, 
or  pass  through,  that  chime. 
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To  explain  how,  I  consider  the  dependence  graph  of  a  single  basic  block.  A  value 
is  said  to  be  live  from  the  time  of  its  production  to  its  last  use.  In  Figure  5.3,  the  hfetime  of 
a  value  is  indicated  by  the  dependence  arrow  that  connects  the  produce  and  the  consuiner 
of  that  value.  Two  values  that  are  live  at  different  times  can  be  assigned  to  the  same  vector 
register.  For  example,  for  both  execution  orders  in  Figure  5.3,  the  values  produced  by  the 
operations  LOADi  and  LOAD^  can  be  stored  in  the  same  vector  reg^ter  In  contrast,  two 
values  that  are  live  at  the  same  time,  such  as  the  values  produced  by  the  operations  +2 
and  LOAD4  in  Figure  5.3,  must  be  stored  in  different  vector  registers  to  avoid  generating 
extra  instructions  that  would  transfer  the  two  values  between  memory  Mid  a  shared  register. 

Although  the  execution  of  register-spill  code,  if  done  at  a  judicious  moment,  may 
not  increase  execution  time,  I  assume,  for  the  sake  of  simplifying  this  example,  that  it  does 
The  impact  of  register  spilling  on  execution  time  and  register  usage  is  taken  into  account 
later  in  this  chapter  in  Section  5.4,  the  quantitative  part  of  this  study.  To  avoid  generating 
any  code  for  spilling  registers  and  thereby  increasing  execution  time  all  simultaneously  h 
values  must  be  assigned  to  different  registers.  Hence,  a  compiler  needs  to  use  only  a  number 
of  registers  that  is  equal  to  the  maximum  number  of  simultaneously  hve  values,  ^  “ninber 
which  is  called  the  critical  register  quantity  by  Eisenbeis,  Jalby,  and  Lichnewsky  [32].  Th 
is  in  fact,  the  minimum  number  of  registers  that  can  be  used  without  spilhng  registers. 
For  the  purposes  of  this  example,  it  is  sufficient  to  know  that  this  mimmum  is  achievable. 
An  assignment  algorithm  that  is  able  to  match  this  minimal  reqmrement  is  described  in 
Section  6.2.3  (on  page  123  in  Chapter  6,  Bus  Usage  and  Register  Assignment). 

Now  that  I’ve  explained  what  is  the  minimum  number  of  registers  needed  to  ex¬ 
ecute  an  order,  I  can  now  show  that  the  two  orders  in  Figure  5.3  have  different  miiu“ 
register  requirements.  To  do  this,  I  must  first  know  what  values  are  hve  at  the  same  time, 
these  are  the  values  that  are  used  by  operations  executing  in  the  same  chime  because  oper¬ 
ations  in  the  same  chime  execute  simultaneously.  Thus,  counting  the  number  of  hve  values 
in  each  chime  reveals  the  maximum  number  of  simultaneously  live  values.  In  Figure  5.3,  th 
number  of  live  values  in  each  chime  is  equal  to  the  number  of  dependence  arcs  that  appear 
in  or  pass  through,  that  chime.  Based  on  this  method  for  determining  minimal  register 
requirements,  the  execution  order  on  the  left  requires  5  vector  registers  to  avoid  generating 
spiU  code  and  that  on  the  right  requires  6  vector  registers. 

In  addition  to  differing  in  execution  time  and  minimal  register  requireinents,  these 
orders  differ  in  two  other  respects.  First,  the  execution  order  on  the  right  eiffiibits  more 
parallelism;  there  are  two  chimes  in  which  four  operations  are  executed  while  at  most  three 
operations  are  executed  in  a  chime  in  the  execution  order  on  the  left.  This  is  because  both 
orders  execute  the  same  number  of  operations,  but  the  one  on  the  right  executes  in  less 
time  The  second  difference  is  that  the  execution  order  on  the  right  uses  the  function 
units  more  effectively.  In  every  chime,  at  least  one  load  port  is  always  used;  m  other  words 
during  the  execution  of  this  loop  using  the  order  on  the  right,  a  load  operation  is  initiated 
every  clock  period.  In  contrast,  there  is  no  single  functional  unit  that  is  always  in  use  when 

the  order  on  the  left  is  executed. 

In  summary,  Figure  5.3  demonstrates  how  using  more  registers  allows  more  paral¬ 
lelism  to  occur,  which  in  turn  reduces  the  time  to  execute  a  loop  and  results  in  more  effective 
use  of  the  functional  units.  In  this  case,  the  Cray  Y-MP  provides  enough  vector  registers  to 
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accommodate  the  faster  execution  order.  However,  one  example  does  not  prove  sufficiency 
in  general.  One  goal  of  this  study  is  to  determine  a  cost-effective  number  of  re^sters,  which 
I  believe  is  more  than  the  8  vector  registers  currently  provided  in  the  Cray  Y-MP  vec  or 

processor. 

5.1.2  Why  a  Different  Scheduling  Algorithm? 

To  substantiate  my  hypothesis  that  more  registers  are  needed  to  improve  perfor¬ 
mance,  I  must  determine  the  minimal  number  of  registers  required  for  maximal 
in  each  loop  of  the  CRI  workload.  A  minimal  register  requirement  is  Msoaated  with  a 
particular  execution  order  of  a  dependence  graph,  as  was  explained  in  the  previous  s 
section,  and  an  execution  order  is  chosen  by  an 

compiler.  Hence,  the  algorithm  used  by  an  instruction  scheduler  affects  both  e^tion 
time  and  register  usage  of  a  loop.  What  is  not  obvious  is  how  much  a  scheduhng  ^60"th^ 
affects  performance  and  register  usage.  In  this  subsection,  I  present  ^ets  of  app^en% 
contradictory  data  that  together  suggest  that  the  performance  impact  of  a  scheduhng  algo¬ 
rithm  can  be  significant  and  that  a  scheduling  algorithm  different  from  the  one  used  in  the 
1990  version  of  the  cft77  compiler  is  needed  to  prove  my  hypothesis. 

The  first  set  of  data  indicates  that  execution  time  could  possibly  be  reduced  by 
significant  amount.  This  data  is  based  on  a  static  lower  bound  for  the  per-iteration  “^cution 
time  of  a  loop.  An  important  characteristic  of  this  lower  bound  is  that  it  is  calculated  using 
only  the  frequency  of  operational  types  in  a  dependence  graph  and  the  number  and  types 
of  Lctional  units  in  the  hardware.  This  provides  a  method  for  quantifying  the  “^xi^ 
improvement  to  performance  without  having  to  generate  an  actual  execution 
achieves  this  improvement.  Because  a  loop  can  consist  of  one  or 

(for  example,  vectorizable  loops  with  conditional  statements  or  scalar  reductions),  the  static 
lower  bound  for  the  per-iteration  execution  time  of  a  loop  is,  in  fact,  based  on  a  lower  bound 
for  the  per-iteration  execution  time  of  a  dependence  graph,  and  is  equal  to  the  sum  of 
lower  bLnd  for  each  of  its  dependence  graphs.  Short  of  actually  executing  a  loop,  there 
is  no  information  about  the  execution  frequency  of  each  dependence  graph 
this  lower  bound  for  a  loop’s  execution  time  is  a  static  one  because  it  does  not  accurately 
account  for  dynamic  information  and,  instead,  assumes  that  all  dependence  graphs  in  a  loop 
are  executed  the  same  number  of  times. 

A  lower  bound  on  the  execution  time  for  a  dependence  graph  is  equal  to  the 
number  of  times  a  critical  resource  is  used,  where  a  critical  resource  is  a  functtonal  amt 
that  is  used  most  trequentiy  to  execute  operations  in  that  dependence  graph  [lO?].  lor 
example,  the  load  port  is  a  critical  resource  tor  the  dependence  graph  in  Figure  5.1.  Ihe 
followtag  statements  summarize  the  relationships  that  establish  the  lower  bound  for  the 
execution  time  of  a  dependence  graph. 

a  lower  bound  for  the  per-iteration  execution  time  of  a  dependence  graph,  G 
=  the  number  of  times  a  critical  resource  is  used  in  G 

<  the  number  of  chimes  needed  to  execute  G 

<  the  number  of  clock  periods  needed  to  execute  one  iteration  of  G 

The  number  of  times  a  critical  resource  is  used  is  a  lower  bound  on  the  execution  time  of  a 
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dependence  graph  because  it  must  be  less  than  or  equal  to  the  number  of  chimes  n^ded  to 
execute  a  dependence  graph;  otherwise,  a  critical  resource  would  be  used  twice  m  the  s^e 
chime,  which  is  impossible.  In  turn,  the  number  of  chimes  needed  to  execute  one  iteration 
of  a  dependence  graph  must  be  less  than  or  equal  to  the  number  of  clock  periods  because 
the  chime  count  only  considers  vectorizable  operations,  whereas  the  clock-period  count  also 
includes  loop-  and  strip-overheads. 

Thus,  determining  the  lower  bound  for  the  execution  tiine  of  a  dependence  graph 
is  a  simple  matter  of  counting  the  different  types  of  operations  in  such  a  graph,  dividing 
the  frequency  of  each  operational  type  by  the  number  of  functional  units  that  execute  that 
operational  type,  and  determining  the  maximum  of  these  quotients.  In  other  words,  a  lower 
bound  for  the  execution  time  of  a  dependence  graph  is  equal  to: 

/  the  number  of  operations  of  type  T _ \ 

Vthe  number  of  functional  units  that  execute  the  operational  type  T ) 

For  example,  the  lower  bound  for  the  dependence  graph  in  Figure  5.1  is  four  chimes  when  the 
hardware  has  two  load-ports  and  a  store-port.  This  lower  bound  is  greatly  affected  by  the 
configuration  of  functional  units.  For  example,  if  there  were  only  one  memory-port  in  the 
hardware,  the  lower  bound  for  the  dependence  graph  in  Figure  5.1  would  be  eight  chimes. 

A  static  lower  bound  on  execution  time  gives  a  static  upper  bound  on  the  im¬ 
provement  in  performance  relative  to  that  of  the  c/f  77  compiler  using  eight  vector  registers. 
Figure  5.4,  which  summarizes  this  relative  performance,  shows  that  there  is  a  possibihty  for 
substantial  improvement  for  almost  all  the  loops  and  that  the  performance  of  the  workload 
can  be  improved  by  up  to  37%.  Although  an  upper  bound  on  relative  performance  differ¬ 
ence  should  always  be  positive,  there  are  three  data  points  that  show  a  negative  difference. 
This  is  because  these  loops  do  not  conform  to  the  assumption  that  all  dependence  gmphs 
in  a  loop  are  executed  the  same  number  of  times.  For  these  loops,  a  scalar  reduction  is 
computed,  and  the  vectorized  version  consists  of  two  dependence  graphs:  one  to  compu  e 
partial  sums  and  the  other  to  compute  the  final  sum  (for  a  full  explanation  of  this  trans¬ 
formation  see  Section  2.3  on  page  17  in  Chapter  2,  Fundamentals  of  Vector  Architectures) 
The  number  of  times  these  dependence  graphs  are  executed  is  different;  the  first  is  executed 
several  times,  and  the  second  is  executed  only  once.  Without  this  dynamic  information,  the 
static  lower  bound  places  equal  emphasis  on  both  dependence  graphs,  thus  over-estimating 
the  execution  time  of  the  loop  and  showing  a  paradoxical  performance  degradation  Despite 
these  three  misleading  data  points,  the  overaU  data  indicate  that  there  is  a  possibihty  for 


significant  improvement  in  performance.  ... 

Unfortunately,  the  next  set  of  data  appears  to  contradict  the  optimistic  promise 

shown  by  the  static  upper  bound.  In  addition  to  the  obvious  question  of  “Do  execution 
orders  exist  that  can  achieve  this  upper  bound?”,  the  question  that  is  more  pertinent  to  this 
study  is  “What  is  the  minimum  number  of  registers  needed  to  achieve  this  upper  bound. 
My  theory  hypothesizes  that  more  than  eight  vector  registers  are  needed.  If  this  hypothesis 
is  true  then  increasing  the  number  of  registers  should,  on  average,  reduce  the  execution 
time  Because  the  cft77  compiler  can  be  directed  to  use  any  number  of  vector  registers,  1 
can  easily  test  my  hypothesis  by  comparing  the  performance  using  8  vector  repsters  with 
the  performance  using  64  registers,  the  latter  number  of  registers  being  sufficiently  large 
to  avoid  adverse  effects  on  performance  due  to  limited  register  capacity.  This  performance 
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Figure  5.4:  Static  Upper  Bound  Vs.  Cft77  Scheduler  Using  8  Registers 

This  graph  shows  the  maximal  improvement  in  performance  over  that  of  the  1990  version 
of  the  cfi77  scheduler  using  8  vector  registers,  indicating  that  there  is  a  possibility  for  significant 
improvement  in  performance.  Section  5.2.1  describes  the  performance  metrics  and  the  basic  layout 
of  this  graph,  and  Figures  5.18  and  5.19  list  the  execution  times  plotted  in  it. 
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Loops  Sorted  by  Average  Execution  Time  Per  Iteration 


Figure  5.5:  77  Scheduler  Using  64  Registers  Vs.  C/177  Scheduler  Using  8  Registers 

To  test  my  hypothesis  that  more  than  8  vector  registers  are  needed  to  improve  performance,  this 
graph  compares  the  performance  of  the  1990  version  of  the  c/i  77  scheduler  using  64  vector  registers  to 
Lat  of  the  same  scheduler  using  8  vector  registers.  Section  5.2.1  describes  the  performance  inetri^ 
and  the  basic  layout  of  this  graph,  and  Figures  5.18  and  5.19  list  the  execution  times  plotted  in  it. 


data,  gathered  using  the  1990  version  of  the  c/1 77  compiler,  is  summarized  in  Figure  5.5. 

Because  more  registers  are  provided,  using  64  vector  registers  should  always  be 
faster  than  using  8.  However,  a  few  loops  execute  slower  with  64  registers  the  worst  case 
being  5%  slower.  After  hand-examining  the  assembly  code  for  several  of  these  loops,  I 
determined  that  the  only  difference  between  using  64  registers  and  using  8  for  the  sarne 
loop  is  that  the  data  is  placed  in  different  locations  in  memory,  possibly  causing  conflicts  to 
occur  in  different  memory  banks.  Because  memory-bank  conflicts  can  cause  a  5%  variance 
in  performance,  I  believe  that  relative  differences  less  than  5%,  including  negative  ones,  are 
due  to  differences  in  memory  access  patterns  and  should  be  considered  insigmficant 

Overall  the  relative  improvement  to  performance  for  the  entire  workload  when 
using  64  vector  registers  instead  of  8  registers  is  about  9%,  a  reasonable  gain  in 
A  great  disappointment,  however,  is  the  meager  distribution  of  the  individual  differences 
in  relative  performance:  only  5  out  of  the  36  loops  show  more  than  a  10%  improvement  in 
performance,  and  the  rest  show  less  than  a  5%  performance  improvement.  Not  enough  loops 
show  a  significant  performance  improvement  to  warrant  increasing  the  number  registers 
beyond  eight.  As  a  result,  adding  more  registers  appears  not,  on  average,  to  reduce  the 

execution  time  significantly. 

One  possible  explanation  I  considered  for  these  apparently  contradictory  resul 


83 


that  the  upper  bound  is  unrealistic,  the  c/t77  scheduler  already  provides  the  best  achievable 
performTc'  apd  8  vector  registers  are  eaoegh.  In  partictUar,  the 

lot  use  any  information  about  the  structure  of  a  dependence  graph,  which  may  act'i  y 
;tZ  the  iLr  bound  from  being  achieved.  I  didn’t 

Loerimented  with  some  of  the  loops  by  hand  in  a  manner  similar  to  that  in  the  exampl 
in  the  previous  subsection  and  estimated  that  their  execution  time  could  be  J 

adding  more  registers.  Another  possible  explanation  is  that  more  than 
nlded  to  achieve  the  upper  bound.  I  didn’t  believe  this  second 

the  execution  orders  I  produced  in  my  hand-experiments  used  fewer  th^  64  registers 
imnrove  performance.  A  third  possibility  is  that  more  sophisticated  compiler  optimzations, 

improve  performance,  this  does  not  preclnde  a  different  schednhng  ^8°"^  f'°“/  J 
so  Hence,  in  addition  to  this  study’s  primary  goal  of 

of  vector  registers,  a  secondary  goal  is  to  show  that  a  different  schednhng  Jgor.thin  us» 

more  registers  more  effectively  than  does  the  cfm  schedntog  ° 

chanter  I  compare,  both  qualitatively  and  quantitatively,  the  impact  the  c/i77scheduler  m 
a  dffferentThLnlm  have  on  performance  to  show  that  this  fourth  explanation  reconciles 

the  disparate  results  presented  above. 


5.2  Experimental  Framework 


In  this  section,  I  describe  the  performance  criteria  and  the  methodolop  I  use  to 
carry  out  the  studies  throughout  this  chapter.  Other  aspects  of  the  experimental  ’ 

r/rthe  architectural  platform,  the  performance  tools,  and  the  workload  are  described 

in  Chapter  4,  Common  Experimental  Framework. 


5.2.1  Performance  Criteria 

As  part  of  my  investigation,  I  compare  the  performance  of  different  combinanons 
of  schedutg  algorithms  and  number  of  vector  registers.  The  raw  data  for  these  W^ison 
Tih.  times  nLded  to  execute  each  loop  in  the  CRI  workload  using  various  combinations 
oi  scheduling  algorithms  and  number  of  registers.  Unless  stated  otherwise  ^ 
times  are  measured  using  the  Cray  Y-MP  simulator.  In  this  subsection,  I  describe  how 
this  raw  data  is  used  to  determine  whether  the  performance  of  two  configurations  diffe 
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significantly.  First,  I  describe  the  performance  metrics  used  to  represent  the  time  it  t^es  to 
execute  a  iLp  and  the  time  it  takes  to  execute  the  entire  workload  when  using  a  particular 
configuration.  I  then  describe  how  I  compare  the  performances  among  various  configurations 
and  give  the  criteria  for  acceptable  performance  improvement.  .  _  ^  „ 

As  the  performance  metric  for  a  loop,  I  use  the  average  time  o  execu  e 

iteration  of  that  loop; 

the  time  to  execute  the  entire  loop  _  ^ 

*  ~  the  number  of  iterations  executed  for  the  loop  L 

An  obvious  aJtetnalivo  to  using  pur-iteiation  time  as  a  performMce  metric 
the  time  it  takes  to  execute  the  entire  ioop.  Loop-execufon  trme  rs  mSuenced  by  ^tos 
arising  from  a  program  and  a  compiler,  whereas  per-iterat.on  time  is  more  mSuenced  by 
just  tL  compiler  because  the  number  of  iterations  executed  for  a  loop,  which  is  influenced 
iy  l  program"  is  factored  out.  This  makes  per-iteration  time  the  preferred  metric  h«au  e  . 
emphasiL  the  differences  in  performance  among  various  execution 

I  want  to  measure.  Moreover,  whereas  loop-execution  time  includes  the  times  to  execute 
L^p  and  strip-overheads,  per-iteration  time  also  includes  these  overheads  >>ecau«  Jt  . 
calLated  from  loop-execution  time.  Although  using  per-iteration  time  assumes  that  all 
loops  in  a  workload  are  executed  the  same  number  of  times,  I  address  this  ^sumption  below 
when  I  discuss  the  criteria  for  evaluating  acceptable  improvements  to  performance. 

Another  alternative  to  per-iteration  time  is  the  MFLOPS^  rate  of  a  loop,  whmh 
a  standard  metric  for  measuring  the  performance  of  scientific  workloads  [55, 
the  MFLOPS  rate  r  of  a  loop  is  computed  from  the  same  data  as  the  per-iteration  time  f, 
these  two  metrics  are,  in  fact,  inversely  proportional  to  each  other.  Just  as  t  is  the  average 
time  it  takes  to  execute  one  iteration  of  a  loop,  i  can  be  interpreted  as  the 
it  takes  to  execute  one  floating-point  operation  of  that  loop.  In  o  er  ^or  s,  . 

the  product  of  ^  and  /,  the  number  of  floating-point  operations  executed  in  one  iteration 
of  that  loop.  This  equaUty  can  be  shown  algebraically  as  follows: 

L  Lx  f  r 

I  use  per-iteration  time  instead  because  not  all  the  loops  in  the  CRI  workload  contain 
floating-point  operations.  Furthermore,  using  per-iteration  times  shifts  the  emphasis  from 
floating-point  operations  to  execution  time,  which  is  more  directly  related  to  performance. 

^  As  the  performance  metric  for  the  entire  workload,  I  use  the  sum  of  per-iteration 
times  of  all  loops  in  the  workload.  Just  as  per-iteration  time  is  proportion^  to  the  inverse 
of  MFLOPS  rate,  this  sum  is  proportional  to  the  inverse  of  the  weighted  harmonic  mean 
of  MFLOPS  rates  of  each  loop,  where  the  weight  for  a  loop  is  the  number  of  floating-poin 
operations  executed  in  one  iteration  of  that  loop.  This  equality  can  be  shown  algebraically 

as  follows: 

E'.  = 

3 MFLOPSis  an  acronym  for  “JlfUlions  of  floating-point  Operations  Per  Second.” 


85 


=  N  X 


=  N  X 


=  N  X 


N 

Ej-x/. 

N 

_ 1 _ 

weighted  harmonic  mean  of  MFLOPS  rates 


where  N  is  the  number  of  loops  in  the  workload 
Ti  is  the  time  to  execute  the  loop 
Li  is  the  number  of  iterations  executed  for  the  loop 
ti  is  the  per-iteration  time  for  the  loop 
r,  is  the  MFLOPS  rate  for  the  loop 
fi  is  the  number  of  floating-point  operations 
executed  in  one  iteration  of  the  loop 

Although  the  harmonic  mean  is  a  standard  metric  for  summarizing  the  performance  of  a 
workload,  I  use  the  sum  of  per-iteration  times  for  the  same  reasons  I  cited  in  the  previous 
paragraph. 

A  critical  aspect  of  my  investigation  is  to  determine  which  of  two  configurations 
is  faster.  To  do  this,  I  examine  the  relative  difference  in  performance  of  both  individual 
loops  and  the  entire  workload.  For  an  individual  loop,  I  use  the  performance  metric  for 
a  loop  to  calculate  the  difference  in  performance  of  a  new  configuration  relative  to  a  base 
configuration  ^ 

^  -  1 

where  tf  is  the  per-iteration  time  for  the  i''*  loop  using  the  base  configuration,  and  is 
the  per-iteration  time  for  the  loop  using  the  new,  and  presumably  faster,  configuration. 
For  a  summary  of  the  individual  relative  differences,  I  use  the  performance  metric  for  a 
workload  to  calculate  the  relative  difference  in  performance  of  the  entire  workload: 


Although  I  examine  many  different  configurations,  I  directly  compare  only  two 
at  a  time  by  plotting  their  relative  performance  differences  in  a  graph  that  has  a  specific 
structure.  Not  only  does  such  a  graph  provide  a  visual  way  to  compare  the  performances  of 
two  configurations,  but  multiple  graphs  with  this  common  structure  allow  the  perforinances 
of  several  configurations  to  be  compared  simultaneously.  Figure  5.5  on  page  82  (in  the 
previous  section)  shows  an  example  of  such  a  graph.  Loops  are  plotted  along  the  X-axis, 
sorted  by  execution  time.  In  other  words,  the  leftmost  loop  executes  in  the  fewest  clock 
periods  per  iteration  when  using  the  c/t77 scheduler  and  8  vector  registers,  and  the  rightmost 
loop  executes  in  the  greatest  number  of  clock  periods.  The  difference  in  performance  of  a 
new  configuration  relative  to  that  of  a  base  configuration  is  plotted  along  the  Y-axis.  The 
new  configuration  is  listed  first  in  the  caption  title  (for  example,  ^Cft77  Scheduler  Using 
64  Registers”  in  Figure  5.5),  and  the  base  configuration  is  listed  last.  Positive  values  for  the 
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isssissssis 

plotted  using  one  of  three  symbols. 

0  10%  <  relative  difference 

X  -5%  <  relative  difference  <  10% 

□  relative  difference  <  -5% 

A  dash^i  horizontal  Una  gives  the  relative  difference  in  performance  for  the 

.them 

are  satisfied: 

1.  the  improvement  in  performance  for  the  entire  workload  is  greater  than  10%,  and 

2.  the  majority  of  individuai  loops  in  the  workload  show  a  performance  improvement 
greater  than  10%. 

In  more  mathematical  terms,  implementing  the  new  confignration  is  justified  only  if: 

1-  ^  greater  than  10%,  and 

2.  the  median  of  the  values  tf/tf  -  1  is  greater  than  10%. 

Any  performance  improvement  of  “r'l^er^f  et 

hardTare  is  10%.  Hence,  any  improvement  in  performance  must  be  a,  least  10%  to  justrly 

the  ^  different  reason.  The  first  criterion  avoids  the  ne^ 

f  AmfiaVil’s  Law  In  other  words,  large  improvements  m  individual 

=-£b::r=“.  r=.xrr„r r.^:; 

criterion  I  present  two  examples 

ical  sets  of  performance  data  for  the  CRI  workload  (all  execution  times  are  rounded  to  the 

nearest  10  clock  periods): 
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The  base  configuration,  which  is  the  cft77  compUer  using  8  vector  registers, 
executes  the  CRI  workload  in  1270  clock  periods. 

For  the  first  example,  suppose  that  a  new  configuration  improves  the  perfor¬ 
mance  of  the  20  loops  with  the  shortest  per-iteration  times  by  45%  but  has  no 
effect  on  the  performance  of  the  other  loops.  Then,  the  median  performance  im¬ 
provement  for  this  example  is  45%.  However,  because  the  sum  of  per-iteraUon 
times  of  the  20  loops  is  240  clock  periods  when  using  the  base  configuration, 

the  performance  improvement  for  the  workload  is  only  ^i270-240-h^ 

= 

For  the  second  example,  suppose  that  a  different  configuration 
performance  of  the  2  loops  with  the  longest  per-iteration  times  by  45%  but  has  no 
effect  on  the  performance  of  the  other  loops.  Because  the  sum  of  the  per-iteration 
times  of  these  2  loops  is  440  clock  periods  when  using  the  base  configuration,  the 

performance  improvement  for  the  workload  is  ^1270-440-1-^  12%. 

However,  because  the  remaining  34  loops  show  no  improvement  in  performance, 
the  median  performance  improvement  is  0%. 


These  two  hypothetical  examples  show  a  huge  variation  in  performance  improvement  among 
individual  loops.  In  both  cases,  a  select  subset  of  loops  shows  a  sigmficant  performance 
improvement  of  45%  whereas  the  rest  of  the  loops  show  none.  In  the  first  example,  this 
select  subset  is  a  significant  portion  of  the  loops  in  the  workload  but  does  not  a 

significant  enough  portion  of  the  time  to  execute  it.  The  select  subset  in  the  second  examp 
represents  a  significant  portion  of  the  time  to  execute  the  workload  but  is  not  a  significant 
portion  of  the  loops  within  it.  An  actual  example  of  this  situation  is  illustrated  in  Figure  5.5, 
which  shows  the  performance  of  the  c/t  77  scheduler  using  64  registers  relative  to  that  of  the 
same  scheduler  using  8.  In  both  cases,  because  only  one  criterion  is  satisfied,  implementing 
the  new  configuration  is  not  worthwhile  despite  large  performance  improvements  among 

individual  loops. 


5.2.2  Methodology 

I  considered  two  methods  for  achieving  the  goals  of  this  chapter.  These  two  meth¬ 
ods  offer  a  tradeoff  between  providing  a  definitive  answer  and  being  able  to  produce  an 

answer  in  a  reasonable  amount  of  time.  . 

The  first  method  solves  the  following  optimality  problem:  determine  the  shortest 

time  to  execute  a  dependence  graph  using  some  fixed  number  of  registers.  By  comparing 
the  optimal  performances  using  a  different  number  of  registers,  I  can  then  definitively  say 
that  the  most  cost-effective  number  of  registers  is  the  one  for  which: 

1.  using  fewer  than  that  number  decreases  performance  significantly,  and 

2.  using  more  than  that  number  does  not  increase  performance  significantly. 

Unfortunately,  the  major  disadvantage  of  this  method  is  that  this  optimality  problem  is  in 
a  class  of  precedence  constrained  problems,  which  are  known  to  be  NP-hard  for  an  arbitrary 
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dependence  graph  [46].  Such  problems  are  so  computationally  intensive  that  obtaimng  an 
answer  for  just  one  instance  can  require  years  of  computational  time;  moreover,  the  e^stence 
of  methods  that  are  less  computationally  intensive  is  currently  thought  to  be  unhkely.  This 
optimality  problem,  however,  is  no  longer  NP-hard  when  the  dependence  ^aph  is  a  tree, 
which  is  a  specially-structured  dependence  graph  with  no  common  subexpressions  ^d  where 
each  operation  has  only  one  dependence;  the  dependence  graphs  illustrated  in  this  chapter 
are  trees  (in  Figures  5.2,  5.12,  and  5.13).  In  fact,  there  are  algorithms  that  t^e  adv^tage 
of  a  tree’s  regular  structure  to  generate  a  minimal  execution-time  order  m  polynomal  time 
[87  98,  99,  3].  But  because  75%  of  the  dependence  graphs  in  the  CM  workload  are  not 
trees,  this  problem  is  NP-hard  for  the  majority  of  this  workload  and,  in  particular,  for  the 


larger  dependence  graphs. 

An  obvious  method  for  finding  an  optimal  order  for  such  an  NP-hard  problem  is 
to  examine  all  possible  orders  for  a  dependence  graph.  The  time  it  would  ta^ke  to  compare 
all  possible  orders  of  a  dependence  graph  with  N  operations  is  proportional  to  A.,  which 
is  a  function  that  grows  more  than  exponentially  with  a  constant  increase  in  the  num  er 
of  operations  in  a  dependence  graph.  Despite  this  daunting  super-exponenti^  growth  rate, 
today’s  computers  can  exhaustively  compare  the  orders  of  a  dependence  graph  if  the  num  er 
of  operations  is  “small”  enough.  Moreover,  characteristics  of  the  problem  can  be  used  to 
reduce  the  number  of  orders  that  are  examined.  For  example,  because  not  all  orders  satisfy 
the  partial  order  specified  by  a  dependence  graph,  we  can  ignore  any  order  whose  prefw  does 
not  satisfy  the  partial  order,  such  as  orders  beginning  with  the  operation  STOREio  for  the 
dependence  graph  in  Figure  5.1  (on  page  74  of  the  previous  section).  Other  rules  based  on 
execution  time  and  register  usage  can  also  be  used  to  further  prune  the  n«n^er  of  examined 
orders.  If  these  pruning  rules  allow  most  of  the  dependence  graphs  in  the  CM  workload  to 
be  exhaustively  compared  in  a  reasonable  amount  of  time,  then  this  method  could  still  be 


used  to  provide  a  definitive  answer. 

To  determine  whether  such  a  method  is  feasible,  I  timed  how  long  a  Sun  SPARCsta- 
tion  1  takes  to  exhaustively  compare  the  orders  of  progressively  larger  and  larpr  dependence 
graphs.  Figure  5.6  shows  the  results  of  these  timings.  Optimal  solutions  for  dependence 
graphs  with  less  than  30  operations  can  be  found  in  less  than  a  minute.  Unfortunately, 
the  time  it  takes  to  find  an  optimal  solution  grows  extremely  quickly  even  when  pruning 
rules  are  used.  Based  on  this  data.  Figure  5.7  lists  estimates  for  the  comparison  times  of 
the  larger  dependence  graphs  in  the  CRI  workload.  Even  computers  in  the  near  future  are 
unlikely  to  improve  this  situation  substantially  because  their  performance  is  progressing  y 
only  a  factor  of  at  most  two  every  two  years,  whereas  an  increase  of  just  two  operations  in 
a  dependence  graph  requires  that  search  time  be  increased  by  a  factor  of  2.4. 

In  summary,  although  this  method  could  provide  a  definitive  value  for  the  most 
cost-effective  number  of  registers,  it  is  infeasible  because  finding  an  optimal  execution  order 
for  the  larger  dependence  graphs  in  the  CRI  workload  requires  years  of  computational  time. 
Moreover,  this  method  is  impractical  from  a  compiler  standpoint  because  the  time  it  takes 
to  find  an  optimal  order  is  far  greater  than  the  time  to  execute  the  resultant  code  for  the 
larger  dependence  graphs,  regardless  of  how  frequently  the  code  is  executed.  This  method 
could  however,  be  used  if  dependence  graphs  with  more  than  50  operations  were  excluded, 
but  doing  so  is  unacceptable  not  only  because  36%  of  them  in  the  CRI  workload  contain 
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Figure  5.6:  Completion  Times  for  Exhaustive  Comparisons 

This  graph  shows  how  long  a  Sun  SPARCstation  1  took  to  find  an  optimal  execution  order  for 
each  of  five  dependence  graphs  from  the  CRI  workload  by  exhaustively  comparing  all  the  execution 
orders  of  a  dependence  graph.  The  five  dependence  graphs  have  7,  15,  18,  26  and  49  operations, 
respectively.  The  dashed  line  shows  that  the  comparison  time  grows  exponentially  relative  to  a 
constant  increase  in  the  operations  in  a  dependence  graph. 
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Figure  5.7:  Estimated  Times  for  Exhaustive  Comparisons 

This  table  lists  estimated  times  for  finding  an  optimal  execution  order  for  some  of  the  larger 
dependence  graphs  in  the  CRI  workload.  The  dashed  line  in  Figure  5.6  is  used  to  calculate  these 
estimates,  which  are  rounded  to  the  nearest  order  of  magnitude. 
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n,ore  than  50  operations,  but  ^so  because  I  expect  these  graphs  to  be  the  ones  with  Uxger 

register  algorithms  that  «ecute  in 

polynomial  time  a^d  then  to  compare  the  “2^-“ 

Lent  number  of  registers.  In  addition  to  to  performan^ 

,0  this  method  is  that  an  algorithm  that  results  m  a  ,  J.  main 

can  be  adopted  by  a  compiler  with  only  a  "‘■““‘J  a  schedul- 

disadvantage  to  this  method  is  that  it  pTL  heuristics  and,  although  heuristics  are 

ing  algorithm  chooses  an  cannot  he  guaranteed, 

designed  to  nuninuze  some  aspect  r  n  ttmt  already  exists  —  namely, 

Nonetheless,  because  I  am  beginning  with  *  “"XLduwLti^  I  can  at  least 
the  Cray  Y-MP  with  8  vector  registers  and  the  when  us- 

show  that  a  exists  is  not  necessarily  obvious  as 

ing  more  registers.  The  fact  that  sucn  an  6  previous  section.  Because  this 

m"com"  S 

to  achieve  the  goals  of  my  study. 

5.3  A  Comparison  of  Two  Scheduling  Algorithms 

In  this  section,  I  describe  and  contrast  the  two 

tvdop^'to^urate'JLriVad'LrX^^ 

“thlLtctCl  teL^rte  dWer^ms  of  the  scheduling  algorithms  and  provide  a 
detailed  Won  of  ^ral- dep" 

operations  to  be  exKuted^  aepresenlations,  scheduling  vectoriaable  operations  is  com¬ 

parable  to  scheduling  thern  for  “ 

well  as  deeply  pipeUned  operations  152].  In  iaitiatLn  parallel  [34,  76). 
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execute  in  unit  time,  and  an  abstract  machine  model  is  used  to  handle  deeply-pipebned  and 
delayed  operations.  Similarly,  a  vector  scheduler  is  concerned  with  grouping  vectorizable 
operations  that  can  execute  in  parallel,  which  in  turn  causes  individual  ones  to  imtiate  in 

parallel. 

A  major  difference  between  schedulers  for  these  architectures  and  one  for  a  vector 
architecture  is  that  the  unit  of  time  for  the  latter  is  one  chime  rather  than  one  do(± 
period.  In  other  words,  a  VLIW  scheduler  groups  operations  into  a  VLIW  instruction,  which 
executes  in  one  clock  period.  A  vector  scheduler,  however,  groups  vectorizable  operations 
that  can  execute  in  one  chime.  But  rather  than  actually  grouping  these  operations  into  one 
instruction,  a  vector  scheduler  merely  specifies  the  order  in  which  these  operations  are  to 
execute  to  produce  the  parallelism  found  by  the  scheduler. 

To  facilitate  scheduling  vectorizable  operations,  I  use  a  chime  table  to  kwp  tra<±  of 
when  operations  are  scheduled  to  execute  on  which  functional  unit.  A  comparable  table  is 
also  used  in  schedulers  for  VLIW  architectures,  and  could  be  called  generically  afuncUonal 
unit  reservation  table.  In  the  context  of  vector  scheduling,  a  chime  table  is  a  matrix  where 
each  row  represents  one  chime,  each  column  represents  a  functional  unit  in  the  hardw^e, 
and  the  entry  represents  an  operation  that  is  to  be  executed  in  the  t  chime  by  the 
functional  unit.  To  generate  an  order,  a  scheduling  algorithm  places  operations  into 
a  chime  table  according  to  a  set  of  rules  that  vary  from  scheduler  to  scheduler.  Once  all 
operations  are  placed  into  a  chime  table,  they  are  removed  from  the  table  in  chime  order, 
so  that  their  dependences  are  still  preserved.  The  order  of  removal  is  the  execution  order 
generated  by  a  scheduling  algorithm. 

As  with  the  examples  in  Figure  5.3  (on  page  77),  an  estimate  of  the  execution 
time  is  given  by  the  number  of  rows  (or  chimes)  with  at  least  one  scheduled  operation,  and 
an  estimate  of  the  minimal  register  requirements  is  given  by  the  maximum  number  of  hve 
values  in  a  chime.  These  are  estimates  only  because  counting  by  chimes  ignores  ^o^h  the 
latencies  of  deeply-pipelined  operations  and  the  execution  of  any  scalar  operations,  both  ol 
which  are  part  of  the  loop  and  strip  overheads.  Although  this  omission  simpUfies  a  vector 
scheduling  algorithm,  these  estimates  are  not  unduly  accurate.  This  is  because  in  any  one 
chime,  the  latency  of  only  one  operation  is  actually  exposed  even  though  many  operations 
are  executed,  and  an  operational  latency  typically  lasts  only  a  fraction  of  a  chime. 

The  type  of  scheduling  algorithms  I  consider  are  called  simple  vector  schedulers 
[107],  ones  that  schedule  operations  from  the  same  iteration  only.  Other  algorithms,  such 
as  polycyclic  scheduling,^*  trace  scheduling,  and  loop  unrolling,  schedule  operations  from 
different  iterations  to  increase  the  amount  of  parallelism  that  occurs.  For  example,  the 
Cray-2  uses  polycyclic  vector  scheduling  to  increase  the  amount  of  parallelism  to  compensate 
for  the  lack  of  chaining  hardware,  which  prevents  flow-dependent  operations  from  executing 
in  parallel  [108,  32,  26).  In  my  study,  I  consider  only  simple  vector  schedulers  because,  as 
the  quantitative  results  of  the  next  section  show,  significant  improvement  to  performance 
is  still  possible  without  having  to  resort  to  more  complex  scheduling  algorithms. 

The  main  difference  between  the  two  scheduling  algorithms  I  use  is  how  operations 
are  placed  into  a  chime  table.  When  examining  these  placement  rules,  there  are  three  issues 

to  consider: 

4polycycUc  scheduUng  is  also  known  as  ioftware  pipelining  [7 6]  or  overlapped  loop  »cheduling[2&]. 
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1.  the  goal  of  the  placement  rules, 

2.  the  order  in  which  operations  are  processed,  and 

3.  the  stales,  for  Sntog  e  time  slot  in  which  an  appropriate  functional  unit  and  regis- 
ter(s)  are  available 

Operations  are  hrst  sorted^^^^^^^  these  priorities  axe 

rep^nrLgeir^P^-rtlhegoalsofthesc* 

execution  time  or  minimize  register  examples  in  Figure  5.3  demonstrate),  these 

increase  register  usage  and  vice  versa  (as  the  exa^es ‘n  *  ig^ 

cannot  be  goals  of  equal  are  d'itont  for  the  two  scheduUng  algorithms  I  use. 

placing  operations  into  a  chim  s  8  and  5  9  These  figures  explain 

Details  about  each  algorithm  are  given  in  Figures  5.8  5.9.  in  ^g«^^  p 

how  the  priorities  and  strate^  exploit  ,dieduling  algorithm  used 

and  secondary  goals  of  the  algorithm,  igur  •  describes  the  algorithm  that  I 

:p:::mirLji^etungtr^ 

scheduling,  which  is  used  for  V  ,  r  i  H  tlinix  but  the  strategy  is  always  the  same: 

when  I  hand-scheduled  several  -i'Pf '>'""2' cf,77  and  list  schedulers  are  summarized  in 
The  major  differences  between  the  cfi77  and  Ust  scneauiei  directlv 

r-  ^  i  n  The  differences  in  order  and  strategy  between  these  two  schedulers  directly 
Figure  5.10.  The  ditterences  in  scheduler  emphasizes  register  usage  over 

reflect  the  differences  in  their  goals.  The  /  P  architecture  with 

nxecution  time  because  it  must  8-'-*'  'f'^Vprovided  in  hardware  is 

only  eight  vector  registers.  Nonetheless,  the  numb  .*  ■„  execution  time  in  an 

an  input  parameter  to  this  hsTscheduler  I  developed  does  not  consider 

attempt  to  use  fewer  registers.  In  contrast,  the  hst  schedumr  1  ^  ,,^^ese  I  am 

hmiting  register  usage  to  imposed  by  limited 

re^L«rclpaci.  “  AlZTh  ‘he  final  step  is  designed  to  reduce  register  usage,  tins  step 

nic°:r;“or:th‘^^^^^ 

uler  to  produce  orders  that  ^ecute  in  description  does  not  indicate 

LTw  mltbeTe^Srst  sSuler'is,  nor  does  it  indicate  how  often  the  Ust  scheduler 

-.«o..ii.h.duh.s.i«od.h™^>^‘  •  .tr  TrSch^rri'er  ", 

what  has  already  been  scheduled  l48]. 


Indicat,  that  all  th,  functional  units  are  being  used  i" 

table  so  that  operations  without  any  predecessors  (typically  lOADs)  can 

Drooerlv. 


of  the  chime 
be  scheduled 


Given  a  dependence  graph,  schedule  each  operation  in  order  of 

1.  appearance  by  statement  in  the  source  code,  and 

2.  within  a  statement,  decreasing  maximal  path  distance  from  an  operation  with 
no  successors  (typically  a  STORE) 

by  choosing  the  first  chime  c  from  the  end  of  th.  chime  table  for  which  on.  of  the 
following  is  true: 

1,  the  number  of  li.e  values  in  chime  c-  1  is  equal  to  th.  number  of  vector  registers 
in  the  hardware, 


2.  a  predecessor  has  been  scheduled  in  chime  c, 

3.  all  appropriate  functional  units  are  being  used  in  chime  c  -  1,  or 

4.  an  operand  to  th,  operation  is  being  used  by  an  operation  already  scheduled  in 
chime  c  -  1. 


Figure  5.8:  C/i 77  Scheduling  Algorithm 

Thishgure describes  how  the schedulingalgorithmu^^ 

places  operations  of  ^  secondary  goal  is  to  minimize  execution  time.  As  an 

S.mpt”t\?ap™aronX-^^^^^^^^  ^^orithm  produces  the  execution  order  on  the  let.  m 

Figure  5.3  for  the  dependence  graph  in  Figure  5.1.  to  operations  to  determine  the 

The  firs,  enumerated  of  Jh“  « “ 

r„e  b‘"  bt“ngr  man, 

-"r::o“a::d".roT..^^^^^ 

the  same  value  is  necessarily  sequential  m  the  Cray  f  r  g 

fourth  item  shows  this  sequential  execution  from  the  viewpoint  of  one  of  the  success 
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Given  a  dependence  graph,  schedule  each  operation  in  order  of 

1.  decreasing  maximal  path  distance  from  an  operation  with  no  successors  (typically 
a  STORE),  and 

2.  for  oporatiors  with  th.  sam,  marcimal  path  distance,  decreasing  number  of  sue- 
cessors 

by  choosing  the  first  chime  c  such  that 

1.  all  ancestors  have  been  scheduled  before  or  in  chime  c, 

2.  an  appropriate  functional  unit  is  available  in  chime  c,  and 

3.  operands  to  the  operation  are  not  being  used  by  an  operation  already  scheduled 
in  chime  c. 

Reschedule  operations  without  any  predecessors  (typically  LOADs)  by  choosing  the 
latest  chime  c  such  that 

1.  all  successors  have  been  scheduled  after  or  in  chime  c,  and 

2.  an  appropriate  functional  unit  is  available  in  chime  c. 


Figure  5.9:  List  Scheduling  Algorithm 

.nto  a"ctmr.rblt 

i:Uo“T,^"X»^he“Vo*l  right  ,n  Figure  5.3  for  Ure  dependence  graph  to 

enumerated  list 

TylwtArr  5esT;«y  to  operafions  that  -- 

dL".drn.‘ hig“er  priority  is  given  to  an  operation  .bo»  value 

"“th^.er„'rrd'S^numeratedh^^ 

table.  The  goal  of  the  first  of  these  Itsts  ^  ,he  explanation  for  its 

p„„be,  in  Figur^e  5.8.  The  go^ 

inclusion  is  the  same  as  me  one  given  ,  •  rprluce  reeister  usage.  This  is  done 

of  the  last  list  is  lo^  operations  that  use  its  value.  Only  operations 

by  placing  an  operation  as  the  most  flexibility  when  placed  in  a 

‘X.  execution  time  of  the  resultant 

“^^^An  algorithm  similar  to  this  one  has  been  implemented  in  a  version  of  the  cfi77  compiler  that 
.  .0,:  the  one  used  fo,  my  studies.  In  addition  to  the  above,  t  e  ne-r  ^^^^^ 

also  takes  into  account  register  usage  to  avoid  generating  an  excessive  number  of  register  sp  ■ 
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cft77  SCHEDULER 

LIST  SCHEDULER 

GOALS 

1.  minimize  register  usage 

2.  minimize  execution  time 

1.  minimize  execution  time 

2.  minimize  register  usage 

ORDER 

as  operations  appear  in  the 
source  code 

based  on  properties  of  the  de¬ 
pendence  graph 

STRATEGY 

start  at  end  of  chime  table 
and  work  backwards,  taking 
into  consideration  the  number 
of  vector  registers  available  in 
hardware 

start  at  beginning  of  chime  table 
and  work  forwards 

Figure  5.10:  Comparison  of  Cft77 and  List  Scheduling  Algorithms 

This  table  summarizes  the  major  differences  between  the  cfi 7 7  &nd  list  scheduling  algorithms. 
Details  about  each  algorithm  are  given  in  Figures  5.8  and  5.9,  respectively. 


is  better.  As  a  result,  quantitative  performance  data  is  needed  to  justify  changing  the 
algorithm  used  in  the  cft77  compiler. 

5.4  How  Many  Vector  Registers? 

Up  until  now,  I  have  described  qualitatively  why  more  registers  and  a  scheduling 
algorithm  that  is  different  from  the  one  used  in  the  1990  version  of  the  cft77  compiler 
are  needed  to  effectively  use  the  functional  units  in  the  Cray  Y-MP.  In  this  section,  I 
present  data  that  not  only  substantiates  these  observations  but,  more  importantly,  shows 
how  many  registers  are  needed  and  how  much  of  an  improvement  to  performance  is  possible 
with  more  registers  and  a  different  scheduling  algorithm.  The  primary  goal  of  this  study  is 
to  find  a  cost-effective  number  of  registers  and  the  secondary  goal  is  to  show  that  a  different 
scheduling  algorithm  uses  more  than  8  vector  registers  better  than  the  c/t 77  scheduler  does. 
The  initial  performance  results  presented  in  Section  5.1.2  suggest  that  the  secondary  goal 
must  be  achieved  before  the  primary  one  can  be  found.  Consequently,  I  first  present  data 
to  show  that  the  Ust-scheduling  algorithm  in  Figure  5.9  uses  more  registers  more  effectively 
than  does  the  cft77  scheduler.  I  then  present  data  by  which  to  choose  a  cost-effective 
number  of  registers,  and  then  I  discuss  the  impact  of  using  this  number  of  registers  on  the 
the  performance  of  vectorizable  loops  and  an  entire  program.  Finally,  I  explain  how  the 
data  shows  that  larger  loops  are  more  likely  to  need  more  registers  for  greater  performance 
and  that  register  spiUs  have  minimal  impact  on  execution  time.  All  the  presenUtions  of 
data  use  the  basic  graph  structure  described  in  Section  5.2.1,  and  the  execution  times  used 
in  these  graphs  are  listed  in  Figures  5.18  and  5.19  at  the  end  of  this  section. 

From  the  performance  results  described  in  Section  5.1.2,  we  already  know  that  the 
cft77  scheduling  algorithm  described  in  Figure  5.8  (in  the  previous  section)  does  not  use 


Relative 

Difference 


Figure  5.11:  List  Scheduler  Using  64  Registers  Vs.  C/i77  Scheduler  Using  8  Registers 

To  show  that  a  different  scheduler  can  use  more  registers  more  effectively  than  the  schedukr 
used  in  the  1990  version  of  the  cfi77  compiler,  this  graph  compares  the  performance  ^ 

scheduler  using  64  vector  registers  to  that  of  the  c/1 77  scheduler  using  8 

describes  the  performance  metrics  and  the  basic  layout  of  this  graph,  and  Figures  5.18  5.19 

the  execution  times  plotted  in  it. 


more  than  8  vector  registers  effectively  because  little  improvement  to  performance  resulted 
X^using  64  registers,  a  number  that  is  sufficiently  large  to  avoid  adverse  performance 
r^ects Te  to  limi  ed  r  gister  capacity.  Hence,  the  secondary  goal  of  this  study  is  easily 
:!hLvedV<iemons.ra.ing  tha.  the  lis.  schedaUag  algoritto  ia  Figure  5  9  o  the  prev.^ 
section  can  use  64  registers  to  significantly  improve  performance  over  that  of  the 
configuration  To  generate  the  raw  performance  data,  I  replaced  the  scheduler  in  ft 
compiler  with  theLt  scheduler.  Figure  5.11  summarizes  the  improvements  to  performan 

that  resul^^  ^  ^  ^ 

8  registers  regardless  of  the  scheduUng  algorithm.  Yet,  there  are  three  loops  Perform 
worse  than  5%  when  the  list  scheduler  is  used  rather  than  the  c/J77one.  These  a  a  p 
emphasize  the  fact  that  scheduling  algorithms  rely  oa  heuristics, 

that  the  best  execution  order  is  generated  for  every  dependence  graph.  For  the  two  wOTst 
cases  which  are  13%  and  10%  slower,  the  order  for  schednltng  operations  causes  the  chi 
:Z’ares  to  be  one  chime  longer  than  that  of  the  corresponding  J 

cft77  scheduler.  For  the  loop  with  the  worst  performance.  Figure  5.12  lUustrates  how  tn 
the  list  sLduler  to  choose  the  scheduling  order  just  does  not  work. 
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whereas  the  heuristic  used  by  the  c/t77  scheduler  is  ideal.  For  the  loop  with  ‘k' 
worst  performance,  two  operations  are  scheduled  in  the  “wrong”  order  because  they  had 
equal  priorities  for  scheduling,  and  I  schedule  such  operations  in  a  random  order.  F'g"''  ^ 
demonstrates  that  using  another  heuristic  to  schedule  operations  with  equal 
list  scheduling  to  generate  an  order  that  executes  in  a  time  comparable  to  one  generated 

by  the  the  list  scheduler  improves  perfo^ 

mance  significantly  for  the  entire  workload  and  for  almost  hall  the  loops.  TheM  gams  are  all 

the  more'impressive  because  the  execution  times  include  the  time  to  execute  loop  and  strip 

Lheads,  for  which  the  heuristics  in  the  list  scheduler  dfo  nothing 

Overall,  the  relative  improvement  to  performance  for  the  entire  wor  o 

when  a  different  scheduler  is  used  and  64  vector  registers  instead  of  8.  In  edition,  21  of 

the  loops  improve  their  performance  by  more  than  5%,  with  17  of  these  achieving  a  perfor- 

mance  improvement  greater  than  10%.  Although  one  of  the  performance  criteria  is  not  met 

(the  median  performance  difference  is  only  8%),  the  distribution  of  the 

differences  is  signiflcantly  better  than  the  distribution  resulting  froin 

ing  64  registers  (shown  in  Figure  5.5),  where  performance  improved  by  “ 

only  5  of  the  36  loops.  Hence,  Figures  5.5  and  5.11  show  that  the  bst-scheduhng  algorithm 
uses  more  than  8  registers  better  than  the  c/177  scheduler  does,  and  that,  at  the  very  le  , 
chaneine  the  scheduling  algorithm  in  the  cfi77  compiler  is  warranted.  , .  a  • 

I  now  have  the  necessary  tools  with  which  to  obtain  the  pninary  go  » 
to  determine  a  cost-effective  number  of  vector  registers.  Although  I  have  shown  that  a 
significant  improvement  to  performance  is  possible  with  64  vector  registers  fewer  registers 
or  even  just  a  change  in  the  scheduling  algorithm  could  produce  comparable  Pe^form^e 
results  in  Figure  5.11  above,  I  compare  two  sc/iedu/erfi^rejister  combinations  where  all  the 
corliponentf  differ;  some  combination  between  these  two  extremes  could  perform  as  wefi  as 
the  list  scheduler  using  64  registers.  To  better  judge  what  a  cost-effective  combination  is  I 
compare,  in  successive  order,  several  pairs  of  sc/iedu/er&reyist^cr  combinations,  J. 

summarizes  these  comparisons,  which  progressively  compare  the  relative  performance  of 
following  scheduler&register  combinations; 

cft77  scheduler  and  8  vector  registers, 
list  scheduler  and  8  vector  registers, 
list  scheduler  and  16  vector  registers, 
list  scheduler  and  32  vector  registers,  and 
list  scheduler  and  64  vector  registers. 

The  first  graph.  Figure  5.14a,  compares  the  performances  of  the  two  sched^ng 
algorithms,  when  both  use  8  vector  registers.  This  graph  shows  that 
does  not  always  effectively  use  8  registers;  the  Ust  scheduler  provides  more 
orovement  for  7  loops.  Hence,  some  of  the  improvement  to  performance  seen  in  Figure  5.11 
I  attributable  to  a  change  of  scheduling  algorithm.  Nevertheless,  Figure  5.14a,  in  com¬ 
bination  with  Figure  5.11,  demonstrates  that  much  of  the  improvement  P^°^“  * 
attributable  to  both  changing  the  algorithm  and  increasing  the  number  of 
ure  5.11  shows  that  using  more  registers  significantly  improves  performance  both  ove  ^ 
as  weU  as  for  individual  loops.  Figure  5.14a  shows  that  using  only  8  registers  with  the 


Cft77  SCHEDULER 

Scheduling  Order 
(statement  number) 

LOAD2  (1) 
LOAD3  (1) 

X4  (1) 

+5  (1) 

STORES  (1) 
LOAD7  (2) 
LOADe  (2) 

X9  (2) 

+10  (2) 

STOREii  (2) 
LOAD12  (3) 
LOAD13  (3) 

Xl4  (3) 

+  15  (3) 

STOREis  (3) 


LIST  SCHEDULER 

Scheduling  Order 
(maximal  path  distance) 

LOAD3  (3) 

LOADS  (3) 

LOAD13  (3) 

LOAD2  (2) 

X4  (2) 

LOAD7  (2) 

X9  (2) 

LOAD12  (2) 

Xi4  (2) 

+5  (1) 

+10  (1) 

+  15  (1) 

STORES  (0) 
STOREii  (0) 
STOREis  (0) 


Operations  Placed  Into  Chime  Table 
I  PAD,  ILOAD3  Ix4  1+5  ISTOREs 
LOAD7  LOADs  X9  +10  STOREii 
LOAD12  LOAD13  xi4l+i5|STOR^ 


Operations  Placed  Into  Chime  Table 

LOAD3  [LOADs  X4 _ _ 

LOAD13  LOAD2  xg  -fs  STOREs 
LOAD7  LOAD12  xi4  +10  STOREii 
+15  STOREis 


Figure  5.12:  Example  Showing  that  List  Scheduler  Uses  Wrong  Heuristic 

This  figure  demonstrates  that  the  c/(77scheduler  produces  an  execution  order  for  the  illustrated 
dependence  graph  that  is  better  than  the  execution  order  produced  by  the  list  scheduler  because 
the  cfi77  algorithm  schedules  operations  in  order  of  statement  number  rather  than  by  maximal  path 

distance. 


LOAD  4 

LOAD,  LOAD 3  ^*5 


STORE- 


Cft77  SCHEDULER 

Scheduling  Order 
(maximal  path  distance) 

LOAD4  (4) 

*  LOAD3  (3) 

♦  LOADi  (3) 

X5  (3) 

+6  (2) 

+2  (2) 

X7  (1) 

STORES  (0) 

Operations  Placed  Into  Chime  Table 

LOADJLOAD3|x5|+6| 

LOADi  X7  +2  STORES 


LIST  SCHEDULER 

Scheduling  Order 
(maximal  path  distance) 

LOAD4  (4) 

*  LOADi  (3) 

♦  LOAD3  (3) 

X5  (3) 

+6  (2) 

+2  (2) 

X7  (1) 

STORES  (0) 

Operations  Placed  Into  Chime  Table 

L0AD4|L0ADi|x5| 

LOAD3 _ +6  _ 

X7  +2  STOREs 


Figure  5.13:  Example  Showing  that  List  Scheduler  Needs  a  New  Heuristic 

This  figure  demonstrates  that  another  heuristic  is  needed  to  determine  a  scheduling  order 
among  operations  with  equal  priority.  Because  the  illustrated  dependence  graph  represents  one 
statement,  the  scheduling  order  used  by  both  the  c^77and  list  schedulers  is 

distance.  The  operations  marked  with  an  asterisk  (.)  have  the  same  maximal  path  dist^ce  but  are 
scheduled  in  reverse  order  for  each  algorithm.  As  a  result,  the  execution  order  produced  by  the  cft77 
scheduler  is  better  than  the  one  produced  by  the  list  scheduler. 

Operations  can  be  grouped  into  a  chain  of  operations  which  can  be  executed  in  one  chime  using 
chaining  hardware  despite  RAW  dependences.  For  example,  the  operations  LOAD3,  LOAD4.  X5,  and 
+6  form  such  a  chain.  The  scheduling  order  used  by  the  list  scheduler,  however,  does  not  allow  this 
chain  to  form  whereas  the  order  used  by  the  c/( 77  scheduler  does.  A  new  heuristic  that  gives  hi^gher 
priority  to  an  operation  in  a  chain  that  is  already  partially  scheduled  would  allow  the  list  scheduler 

to  generate  the  better  execution  order. 
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(a)  List  Scheduler,  8  Registers  vs.  Cft77  Scheduler.  8  Registers 
50%, 


25% 


SYMBOL  LEGEND 

o-  10%  S  relative  difference  •  — 
X  -5%  S  relative  difference  <  10% 
P  relative  differen.ce,<-5% 
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(b)  List  Scheduler,  16  Registers  vs.  List  Scheduler,  8  Registers 

50%, 

25% 


Relative 
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0  10  20  30 

(c)  List  Scheduler.  32  Registers  vs.  List  Scheduler,  16  Registers 


avgs8% 
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Difference 

-25% 


(d)  List  Scheduler,  64  Registers  vs.  List  Scheduler,  32  Registers 
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25% 

^  ,^l  „  „  . . .  K  V  M  „  II  .M.  .H<^avg=0% 
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- 1 - ^  ^ 
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Loops  Sorted  by  Average  Execution  Time  Per  Iteration 


Figure  5.14:  Performance  Comparisons  of  Various  Scheduler& Register  Combinations 

These  graphs  compare  the  performances  of  various  schedukr&regisitr  combinations  whose 
results  are  used  to  choose  the  most  cost-effective  one.  The  loops  in  graphs  (b)  and  (c)  that  show 
a  performance  degradation  of  8%  and  9%  (as  indicated  by  the  D’s)  are  unexpected  because  using 
more  registers  should  result  in  a  relative  difference  in  performance  no  worse  than  -5%.  A  closer 
examination  of  the  assembly  code  for  these  two  loops  revealed  that  the  only  difference  between 
using  different  numbers  of  registers  is  that  the  data  is  placed  in  different  locations  in  memory, 
possibly  causing  conflicts  to  occur  in  different  memory  banks.  Although  memory-bank  conflicts 
can  cause  up  to  a  5%  variance  in  performance,  these  two  data  points  indicate  that  sometimes  the 
variance  can  be  greater.  Section  5.2.1  describes  the  performance  metrics  and  the  basic  layout  of 
these  graphs,  and  Figures  5.18  and  5.19  list  the  execution  times  plotted  in  them. 
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Ust  scheduler  produces  more  than  a  5%  performance  degradation  for  10  loops,  8  of  which 
show  a  performance  degradation  of  10%  or  more.  Because  other  loops  show 
improvement,  the  relative  difference  in  performance  for  the  entire  workload  is  sm^  (  4%), 
despite  the  significant  degradation  in  performance  of  many  individual  loops.  ThuS’ 
ures  5.11  and  5.14a  together  show  that,  as  long  as  an  appropriate  scheduhng  algorithin  is 
used,  more  than  8  vector  registers  are  needed  to  effectively  use  the  functional  units  of  the 

Cray  Y-MP.  ,  ,  . 

The  last  three  graphs  in  Figures  5.14  demonstrate  how  the  performance  of  the  list 

scheduler  is  affected  by  the  number  of  registers  in  the  hardware.  Increasing  the  number  o 
registers  from  8  to  16  improves  the  performance  of  the  workload  by  14%  and  improves  y 
mire  than  10%  the  performance  of  10  out  of  the  36  loops.  Doubling  the  number  of  registers 
from  16  to  32  results  in  a  fair  improvement  (8%)  in  the  workload  performance  but  provides 
more  than  10%  performance  improvement  for  only  2  of  the  loops.  Finally,  using  64  registers 
instead  of  32  results  in  relatively  little  improvement  to  performance  in  either  the  workload 
or  individual  loops.  Because  the  greatest  gain  in  performance  is  obtained  by  incre^ing  e 
number  of  registers  from  8  to  16, 1  conclude  that  16  vector  registers  is  enough  to  effectively 

use  the  functional  units  in  the  Cray  Y-MP.  ,  .  t..  ^  ^  a 

Figure  5.15,  which  is  a  complementary  graph  to  those  displayed  in  Figure  5.  , 

shows  the  improvement  in  performance  of  the  list  scheduler  using  16  vector  regi^sters  relative 
to  the  cft77  scheduler  using  8.  The  graphs  in  Figure  5.14  show  the  relative  change  in  per¬ 
formance  as  scheduler&register  comhmutlons  progressively  change  from  the  77  scheduler 
using  8  registers  to  the  list  scheduler  using  64.  Whereas  this  illustrates  which  intermediate 
combination  achieves  the  greatest  gain  in  performance.  Figure  5.15  indicates  the  actu 
improvement  to  performance  over  the  cft77  scheduler  using  8 

provement  (9%)  of  the  Ust  scheduler  using  16  registers  is  not  as  high  as  that  of  th 

same  scheduler  using  64  registers.  This  is  because  the  largest  loop,  which  represents  2  % 
of  the  execution  time  of  the  workload,  can  stiU  benefit  tremendously  by  using  more  than 
16  registers.  However,  the  overaU  improvement  of  the  Ust  scheduler  using  16  registers  is 
the  same  as  that  of  the  cft77  scheduler  using  64.  Moreover,  the  number  of  individual 
loops  (15)  showing  significant  improvement  in  the  former  combination  is  noticeably  higher 
than  that  (5)  of  the  latter  combination,  and  almost  matches  that  (17)  of  the  Ust  scheduler 
using  64  registers.  Consequently,  although  the  Ust  scheduler  using  16  registers  falls  just 
short  of  the  performance  criteria  for  justifying  more  hardware,  this  scheduler&register  com¬ 
bination  comes  reasonably  close  and  requires  the  least  increase  in  hardware  for  the  greatest 

gain  in  performance.  ,  r 

The  improvements  to  performance  reported  in  this  study  are  only  for  the  vecto  - 

izable  portions  of  a  program.  To  determine  the  improvement  in  performance  of  an  entire 
program,  we  can  use  Amdahl’s  Law  to  calculate  a  program’s  speedup  5*  as  a  function  ol 
both  vector  speedup  k  and  fraction  of  time  spent  executing  vectorizable  code  /; 


!-/  +  { 


Under  the  assumption  that  the  CRl  workload  is  representative  of  the  vectorizable  portion 
of  a  program,  I  use  the  improvement  in  performance  over  the  entire  workload  (1.09)  as  the 


102 


Figure  5.15:  List  Scheduler  Using  16  Registers  Vs.  C/t 77  Scheduler  Using  8  Registers 

To  show  the  improvement  to  performance  of  a  cost-effective  number  of  registers,  this  graph 
compares  the  performance  of  the  list  scheduler  using  16  vector  registers  to  that  of  the  77  scheduler 
using  8  vector  registers.  Section  5.2.1  describes  the  performance  metrics  and  the  basic  layout  of  this 
graph,  and  Figures  5.18  and  5.19  list  the  execution  times  plotted  in  it. 
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Figure  5.16:  Performance  Improvement  of  a  Program 

Using  Amdahl’s  Law.  this  graph  shows  that,  although  the  performance 
cram  decfines  with  decreasing  portions  of  vectorizable  code,  the  decline  m  performance  for  a  small 
vector  speedup  is  not  as  rapid  as  the  decline  for  a  large  vector  speedup.  The  curve  labeled  5io  shows 
the  program  speedup  when  vector  speedup  is  10,  a  speedup  which  typically  results  when  using  vector 
hardware  rather  than  scalar  hardware.  The  curve  labeled  S,  give.  *'■' 

vector  speedup  is  1.09,  which  is  the  average  improvement  in  performance  of  the  CM  workload  when 
using  the  list  scheduler  with  16  vector  registers  rather  than  the  c/(  77  scheduler  with  8  register  . 

For  a  large  vector  speedup  of  10,  program  speedup  is  inversely  proportional  to  1-f,  the  amount 
of  non.vecionzable  code,  which  results  in  a  rapid  decline  in  overall  program  speedup  as  the  amount 
of  vectorizable  code  decreases.  In  contrast,  for  a  small  vector  speedup  of  1.09,  program  speedup  is 
linearly  proportional  to  /,  the  amount  of  vectorizable  code,  which  results  in  only  a  near-linear  decline 

in  overall  program  speedup  as  /  decreases. 


vector  speedup.  Figure  5.16  shows  that  the  improvement  in  program  performance  when 
using  the  list  scheduler  and  16  vector  registers  degrades  only  linearly  as  the  amount  of 
vectLzable  code  in  a  program  drops.  Because  a  near-linear  decline  in  program  speedup 
occurs  for  vector  speedups  of  less  than  2,  this  rate  of  decline  would  stiU  occur  even  if  the 
vectorizable  portion  of  a  program  consisted  mainly  of  loops  with  the  larger  improvements. 
This  is  in  contrast  to  typical  applications  of  Amdahl’s  Law  that  show  a  precipitous  drop 
in  program  speedup  because  such  analyses  use  the  much  higher  vector  speedup  of  10.  As 
a  result,  although  the  improvement  indicated  by  the  study  in  this  chapter  is  sm  re  a- 
tive  to  typical  applications  of  Amdahl’s  Law,  its  impact  on  a  program’s  over^  unproved 
performance  does  not  drop  off  as  rapidly  with  decreasing  amounts  of  vectorizable  code. 

Finally  when  I  explained  the  issues  involved  in  this  study,  I  made  two  observations 
that  my  performance  data  also  substantiates.  First,  in  justifying  the  inclusion  of  large  loops 
in  the  CRI  workload,  I  stated  that  these  were  the  ones  more  likely  to  require  more  registers 


104 


for  enhanced  performance  because  these  are  more  Ukely  to  have  more  operations  that 
execute  in  parallel.  Figure  5.14b  supports  this  intuition  in  that  only  the  loops  on  the  righ^ 
hand  side  of  the  graph  show  any  significant  improvement  when  the  number  of  registers  is 
doubled  from  8  to  16.  These  loops  are  the  larger  ones  because  loops  with  more  operations 
will  tend  to  execute  longer,  and  longer-executing  loops  are  plotted  on  the  right-hand  side 

of  the  graphs.  ,  ,  ...  ., 

Second,  in  explaining  why  more  registers  are  needed,  I  stated  that  avo.drng  register 

spills  is  not  an  adequate  reason  because,  once  enough  registers  are  provided,  execution  o 
register-spiU  code  has  little  impact  on  the  execution  time.  To  demonstrate  this,  I  reproduce 
Figures  5.14b  and  5.14c  in  Figure  5.17  and  use  boxes  to  mark  loops  whose  nummal  register 
requirement  is  greater  than  the  base  configuration  for  that  graph.  Because  I  id  not  include 
any  mechanism  for  matching  register  usage  with  what  the  hardware  provides, 
execution  order  and  hence,  the  same  minimal  register  requirement  for  a  loop 
by  the  list  scheduler,  irrespective  of  the  actual  configuration  of  the  regis  er  file.  If  inore 
registers  are  required  than  the  hardware  can  provide,  the  c/i77  compiler  will  generate  extra 
inductions  to  spill  registers.  For  example,  as  Figure  5.17  i^^/trates  register-spill  codej 
generated  for  17  loops  when  only  8  vector  registers  are  provided  in  “j  L 

5  loops  when  16  registers  are  available.  Execution  of  this  extra  code  could  not  be  effectively 
hidden  when  using  only  8  registers,  as  indicated  by  the  drastic  improvement  in  Perform^ce 
of  several  loops  when  the  number  of  registers  is  doubled  to  16.  In  contrast,  J 
of  register-spiU  code  has  Uttle  impact  on  performance  for  3  of  the  5  loops  when  using 
16  registers,  as  indicated  by  the  lack  of  improvement  in  performance  for  these  loops  when 
the  number  of  registers  is  doubled  to  32.  Hence,  once  enough  registers  are  provided,  register 
spills  can  be  accommodated  with  little  impact  on  performance,  and  the  better  reason  for 
ading  more  registers  is  to  aUow  more  aggressive  scheduling  so  that  more  parallehsm  occurs. 

Figures  5.18  and  5.19  give  the  per-iteration  execution  times  and  imnimal  register 
requirements  for  each  loop.  These  data  are  used  in  the  various  graphs  I  presented  in  this 

section. 


5.5  Related  Work 

Three  other  groups  of  researchers  have  investigated  scheduling  algorithms  for  vec¬ 
tor  architectures  that  implement  a  restrictive  form  of  chaining  and  for  ones  that 
fully  flexible  chaining.  (Chaining  hardware  is  described  in  Chapter  2,  Fundamentals  of 
Vector  Architectures  on  page  11).  The  Cray-1  is  an  example  of  a  vector  architecture  with 
restricted  chaining  and  the  Cray  X-MP  and  Y-MP  are  examples  of  ones  with  fully  flexibk 
chaining  Arya  modeled  the  problem  of  finding  an  optimal  execution  order  as  an  integer 
programming  problem,  a  technique  which  is  expected  to  take  considerably  longer  to  execute 
than  heuristic  approaches  such  as  list  scheduling  [7].  Bernstein,  Boral,  and  Pinter  extende 
the  work  of  Aho  and  Johnson  to  apply  to  vector  architectures  and  presented  an  optimal 
algorithm  that  always  generates  an  order  that  executes  in  the  shortest  time  for  a  given 
number  of  vector  registers  in  hardware  [10,  3].  However,  this  algorithm  is  apphcable  only 
for  dependence  graphs  that  are  trees,  which  have  no  common  subexpressions.  Finally,  lang 
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Figure  5.17:  Register  Usage  of  the  List  Scheduler 

of  the  base  configuration  of  a  graph,  a  '  ,  pj  5  Ig  5,19  list  the  data 

performM.ce  metrics  and  the  basic  layout  of  these  graphs,  and  Figures  5.18  Mid 
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tminimal  register  requirement 


Figure  5.18:  Performance  Data  for  the  18  Shortest  Loops 

For  the  18  shortest  loops  in  the  CRI  workload,  this  table  gives  the  average  execution  tiine  per 
iteration  using  two  scheduling  algorithms  with  a  different  number  of  vector  registers  in 
Descriptions  of  the  cfill  and  list  schedulers  are  given  in  Figures  5.8  and  5.9  (on  pages  93  and  94), 
respectively  The  column  labeled  7^  lists  the  minimal  register  requirement  of  the  execution  order 
produced  for  a  loop  by  the  list-scheduling  algorithm.  The  rightmost  column  pves  the  minim^ 
execution  time  for  a  loop  based  on  functional-unit  usage.  Data  for  the  other  half  of  the  CRI  workload 

is  given  in  Figure  5.19. 


tminimal  register  requirement 


Figure  5.19;  Performance  Data  for  the  18  Longest  Loops  and  the  Entire  Workload 

For  the  18  longest  loops  in  the  CRI  workload,  this  table  gives  the  average  execution  time  per 
iteration  using  two  scheduling  algorithms  with  a  different  number  of  vector  registers  in  hardware 
Descriptions  of  the  c/^77and  list  schedulers  are  given  in  Figures  5.8  and  5.9  (on  pages  93  and  94), 
respectively  The  column  labeled  Tl  lists  the  minimal  register  requirement  of  the  execution  order 
produced  for  a  loop  by  the  list-scheduling  algorithm.  The  rightmost  column  gives  the  minim^ 
execution  time  for  a  loop  based  on  functional-unit  usage.  Data  for  the  other  half  of  the  CRJ  workload 

is  given  in  Figure  5.18. 
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DavMson  investigated  the  itnpnct  “7"“  "  "" 

architectural  features  of  the  Cray-1  and  Cr^  Tang  as  well  as  Eisenbeis,  Jalby, 

which  does  not  implement  chaining  [  ,  ]  expense  of  using  more  registers.  Both 

of  the  Cray.2  but  do  not  invesHgnte  how  n.»y  more  .1,, 

The  above  studies 

algorithm  aspect  and  have  assume  a  nprformance  of  a  scheduling  algo- 

Mangione-Smith,  Abraham,  and  The  purpose  of  their  study,  which  is  a 

rithm  using  different  hardware  config  I  J  determine  the  dimensions 

continuation  of  the  work  by  Tang  et  a  .  an  isen  workload  to  execute  in  the  shortest 

of  a  vector  register  hie  that  «  the  loops 

time  when  using  a  polycychc  schedul  •  nn  the  snecial  structure  of  a  dependence 

register  requirement  of  a  polycyclic  schedule  ^contrast,  75%  of  the 

graph  that  is  a  tree,  the  loops  used  in  their  ^  '  Expressions.  To  determine 

dependence  graphs  in  the  CRI  algorithm  imphcitly 

the  minimum  number  of  vector  registers  needed  j" 

enumerates  all  polycyclic  schedules  or  t  a  workload  is  possible  when  using 

"  -r t  S  r ~  one  Ih  .  registers  a.d 

16  elements  per  register. 


5.6  Summary 

In  this  chapter,  I  answered  the  question  posed  in  the  opening  paragraph  of  this 

dissertation.  registers  are  enough  to  effectively  use 

the  functional  units  of  the  Cray  Y-MP  vector  processor. 

To  do  .bia,  I  examined  bow 

''■f/  -Ld'r.\tirt”uw\raff^^^^^  sig„mcan.ly.  An  undert^ing  for 

Srint  ‘vtriy'^ba.  increaaing  .be  number  trV:r:eVS‘’'t’:;" 

knowledge  and  experimental  data.  1  g  example  when  the  cft77  scheduler 

P0.besea  tba.  were  "V  ^  ,  Eiffel,  acbeduier 

performed  SO  poorly  using  6  g  .  that  8  recisters  were  enough  because  I  un- 

L  needed  rather  .ban  ^  ^  ^.^r  le^.ion  Ime.  Moreover. 

“Xt?- XS:rire  .ha.  more  registers  couid  improve  performance  by 
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allowing  more  paralleUsm  to  occur.  The  computational  impracticality  of  enumerating  all 
possible  execution  order  for  a  loop  was  another  hypothesis  that  was  verified  by  experimen¬ 
tation,  which  in  turn  led  to  the  use  of  scheduling  algorithms  that  are  based  on  heuristics. 
Correctly  interpreting  some  results  is  another  example  of  the  interaction  between  knowl¬ 
edge  and  data.  For  example,  The  fact  that  the  scheduling  algorithms  I  used  axe  based  on 
heuristics  explains  why  the  list  scheduler  does  worse  than  the  c/i 77  scheduler  on  some  loops. 

Theoretical  knowledge  alone,  however,  does  not  provide  enough  information  to 
make  decisions;  quantitative  results  are  also  needed.  (Such  results  must  be  obt^ned  by 
experimentation  because  the  current  state  of  the  art  in  performance  analysis  does  not 
include  analytical  formulas  that  can  produce  such  data.)  For  example,  knowing  that  fin  ng 
a  definitive  answer  to  the  above  question  is  an  NP-hard  problem  does  not  necessarily  rule 
out  choosing  this  method.  However,  after  producing  experimental  data  that  showed  that 
current  computers  or  even  ones  in  the  near  future  would  require  years  to  find  the  defimtive 
answer  for  the  majority  of  the  loops  in  the  CRI  workload,  I  chose  an  approach  that  is 


computationally  practical  but  cannot  give  a  definitive  answer. 

Although  theoretical  knowledge  provides  a  qualitative  ranking,  experiment^  data 
is  necessary  to  quantify  how  much  better  and  how  often  performance  is  improved  For 
example,  the  descriptions  of  the  two  scheduling  algorithms  indicate  that  the  hst  scheduhng 
should  outperform  the  c/t  77  scheduler  because  of  their  differing  goals,  which  are  summarized 
in  Figure  5.10  (on  page  95).  Nonetheless,  both  algorithms  rely  on  heuristics,  which  can  be 
either  exploited  or  thwarted  by  different  dependence  graphs;  Figures  5.3  (on 
for  example,  demonstrates  how  the  list  scheduler  generates  a  better  execution  order  than 
does  the  c/<77  scheduler,  and  Figure  5.12  (on  page  98)  iUustrates  the  reverse  What  these 
qualitative  descriptions  and  examples  do  not  indicate  is  how  much  nor  how  frequently  one 
scheduling  algorithm  outperforms  the  other. 

Another  example  demonstrating  how  quantitative  results  sharpen  a  qualitative 
ranking  concerns  the  usage  of  registers.  Although  more  registers  should  allow  more  par^- 
leUsm  to  occur  which,  in  turn,  allows  the  functional  units  to  be  used  more  often,  a  quanti¬ 
tative  study  is  needed  to  indicate  not  only  how  many  registers  improve  the  performance  by 
how  much  but  also  how  frequently  such  performance  improvement  occurs.  To  justify  chang¬ 
ing  the  scheduling  algorithm  in  the  cft77  compiler  or  increasing  the  number  of  registers  a 
quantitative  study  using  a  representative  set  of  loops  is  needed  to  show  not  only  that  the 
improvement  to  performance  is  large  enough,  but  also  that  the  improvement  occurs  often 

enough. 

Figure  5.20  summarizes  the  experimental  data  I  produced  to  justify  the  proposed 
changes  This  figure  shows  the  performance  of  various  scheduler&register  combinations  rela¬ 
tive  to  that  of  the  c/i 77  scheduler  using  8  vector  registers.  A  scheduler&register  comhm^tion 
improves  performance  significantly  if  it  improves  performance  for  both  the  entire  workload 
and  a  majority  of  the  individual  loops.  The  rationale  for  prescribing  both  these  criteria  is 
given  in  Section  5.2.1.  Based  on  these  criteria  for  acceptable  improvement  in  performance, 
the  cft77  scheduler  does  not  improve  performance  significantly  for  the  CRI  workload  even 
with  an  abundance  of  vector  registers,  whereas  the  list  scheduler  using  only  16  regsiters 
does  Although  the  list  scheduler  using  32  registers  further  improves  the  performance  of  the 
entire  workload,  the  distribution  of  improvement  in  performance  for  individual  loops  does 
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Figure  5.20:  Summary  of  Performances  for  Various  Scheduler  &  Register  Combinations 

This  table  summarizes  the  performance  of  various  schtdultr&rtgisitr  relative  to 

that  of  the  scheduling  algorithm  implemented  in  the  1990  version  of  the  c/i  77  compiler  using  8  vector 
registers.  The  raw  data  for  this  performance  summary  are  given  in  Figures  5.18  and  Figures  5.19, 
and  Section  5.2.1  (pages  83  to  87)  explains  the  metrics  used  to  compare  the  performance  of  a 
scheduUr&regtsier  combination  to  that  of  the  cft77  scheduler  using  8  registers.  Descriptions  of 
the  cft77  and  list  scheduling  algorithms  are  given  in  Figure  5.8  and  5.9  (on  pages  93  and  94), 
respectively.  Because  these  two  scheduling  algorithms  are  based  on  heuristics,  it  is  possible  that 
either  one  outperforms  the  other  for  individual  loops.  For  example,  when  using  64  registers  the 
eft  77 scheduler  outperforms  the  list  scheduler  by  more  than  5%  for  3  loops,  whereas  the  list  scheduler 
outperforms  the  c/177  scheduler  for  other  loops.  Section  5.1.2  (pages  79  to  83)  describes  how  the 
static  upper  bound  is  calculated  and  also  explains  why  this  “upper  bound”  has  worse  performance 
than  the  c/(  77  scheduler  for  some  of  the  loops. 


not  change  significantly  for  this  scheduler&register  comhin&tion  relative  to  the  distribution 
for  the  same  scheduler  using  16  registers.  Hence,  because  the  improvement  in  performance 
does  not  justify  doubling  the  number  of  registers  from  16  to  32,  the  answer  to  the  question 
posed  in  the  opening  paragraph  of  this  dissertation  is  16  vector  registers,  as  long  as  an 

appropriate  scheduling  algorithm  is  used.  v  o  ♦ 

In  this  chapter,  I  emphasized  performance  in  order  to  show  that  more  than  8  vector 

registers  can  improve  performance  by  more  than  10%.  Because  increasing  the  nuinber  of 
registers  is  a  costly  endeavor,  I  wiU  examine  the  cost  of  doing  so  in  the  next  chapter. 
Furthermore,  in  order  to  balance  the  improved  performance  with  increased  cost,  I  wiU  also 
investigate  a  special  organization  of  16  vector  registers  that  requires  only  a  10%  increase  in 

hardware. 


Ill 


Chapter  6 

Bus  Usage  and 
Register  Assignment 


In  the  previous  chapter,  I  showed  that  using  more  than  8  vector  registers  signifi¬ 
cantly  improves  performance.  However,  because  this  analysis  of  performance  excludes  the 
cost  of  implementing  more  vector  registers  I  explain  in  this  chapter  why  adding  vector  reg¬ 
isters  in  a  straightforward  fashion  is  ill-advised  and  investigate  a  different  organization  for 
a  vector  register  file  that  is  more  cost-effective. 

To  introduce  what  this  different  organization  is,  1  first  analyze  the  hardware  cost  of 
implementing  more  registers  and  suggest  an  organization  that  is  less  costly.  Implementing 
a  new  register  file,  however,  is  viable  only  if  it  can  be  used  and  used  without  degrading 
performance.  Accordingly,  the  bulk  of  this  chapter  addresses  these  concerns  of  utility  and 
performance.  After  introducing  the  new  organization,  I  describe  an  assignment  algorithm  I 
developed  that  uses  vector  register  files  with  such  an  organization.  I  then  present  data  to 
evaluate  how  well  my  algorithm  uses  such  configurations  and  finish  with  a  discussion  about 
choosing  a  cost-effective  one  for  the  Cray  Y-MP  vector  processor. 

6.1  Cost /Performance  Analysis 

In  the  previous  chapter,  I  presented  data  showing  that  doubling  the  number  of 
vector  registers  from  8  to  16  improved  performance  by  9%.  Although  implementing  more 
registers  obviously  increases  the  cost  of  hardware,  a  general  rule-of-thumb  is  that  a  new 
hardware  configuration  that  increases  the  cost  of  implementation  is  acceptable  if  it  im¬ 
proves  performance  by  at  least  as  much  as  it  increases  the  cost  of  implementation.  To 
determine  whether  doubling  the  number  of  vector  registers  is,  in  fact,  an  acceptable  trade¬ 
off  between  increased  cost  and  improved  performance,  in  this  section  I  first  analyze  the 
hardware  cost  of  implementing  more,  and  then  I  combine  this  cost  analysis  with  the  perfor¬ 
mance  analysis  of  the  previous  chapter  to  examine  the  tradeoff  between  increased  cost  and 
improved  performance  for  various  configurations  of  a  vector  register  file.  Finally,  based  on 
this  cost /performance  analysis,  I  outline  the  goals  of  this  chapter  s  study. 

A  vector  register  file  actually  consists  of  multiple  banks  of  registers,  each  of  which 
has  two  buses:  a  read  bus  to  an  interconnection  that  sends  operands  to  a  set  of  functional 
units,  and  a  write  bus  from  an  interconnection  that  receives  results  from  those  functional 
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units  (see  Figure  2.4  on  page  13  in  Chapter  2,  Fundamentals  of  Vector  Architectures). 
Increasing  the  size  of  the  register  file  in  a  straightforward  fashion  incre^es  not  o^y  the 
number  of  register  banks  but  also  the  size  of  the  interconnections.  In  Chapter  3,^  Case 
for  Vector  Architectures,  I  argued  that  the  size  of  these  interconnections  is  negbgible  when 
compared  to  the  registers  themselves.  So  the  cost  of  implementing  more  vector  registers  m 
a  straightforward  fashion  is  determined  mainly  by  the  register  cells.  But  this  cost  analysis 
applies  only  to  a  single-chip  VLSI  implementation.  A  different  cost  analysis  is  nwded  for 
si  multi-chip  implementation,  which  is  how  the  Cray  Y-MP  vector  processor  is  built,  where 
the  relative  sizes  of  register  banks  and  interconnections  are  quite  different  from  their  relative 
sizes  in  a  single-chip  implementation. 

The  function  of  an  interconnection  is  to  transfer  large  amounts  of  data  between 
vector  registers  and  functional  units.  Wawrzynek  and  von  Eicken  have  shown  that  in  a 
single-chip,  CMOS  implementation  such  functionality  is  provided  relatively  easily  J>y 
metal  Unes  and  multiplexors  that  take  up  a  negbgible  amount  of  extra  space  [117,  118J.  On 
the  other  hand,  in  a  multi-chip  implementation  with  64-bit  data  buses,  this  functionab  y 
requires  numerous  pins  to  provide  a  physical  path  between  vector  registers  and  function 
units.i  Poj.  example,  each  bus  of  a  vector  register  requires  64  pins  and  each  input  or  output 
bus  from  a  functional  unit  needs  64  pins.  Because  large  numbers  of  pins  are  required  and 
because  one  chip  can  accommodate  only  a  relatively  small  number  of  pins,  many  clups 
are  needed  to  implement  these  two  interconnections.  In  fact,  the  number  of  chips  needed 
to  implement  both  interconnections  is  1.5  times  the  number  of  chips  that  implement  the 

8  vector  registers  in  the  Cray  Y-MP.  r 

Given  that  8  vector  registers  account  for  10%  of  the  chips  in  an  implementation  of 
one  Y-MP  processor  and  the  accompanying  interconnections  account  for  15%  of  the  chips, 
doubbng  the  size  of  the  vector  register  file  from  8  to  16  registers  in  a  straightforward  fashion 
results  in  a  25%  increase  in  the  chip  count.  Hence,  in  a  multi-chip  implementation,  such 
as  that  of  the  Cray  Y-MP  vector  processor,  more  than  half  the  cost  of  implementing  more 
registers  is  due  to  increasing  the  size  of  the  interconnections.  Because  doubbng  the  number 
of  vector  registers  produces  a  9%  improvement  in  performance,  using  16  does  not  appear 
to  be  an  acceptable  tradeoff  between  increased  cost  and  improved  performance. 

(Although  this  analysis  covers  only  the  cost  of  a  processor  and  ipores  the  cost 
of  the  memory  and  I/O  systems  of  a  computer,  the  main  pupose  for  this  cost  analysis 
is  to  motivate  the  investigation  of  a  special  organization  of  registers  that  reduces  the  cost 
of  adding  more  registers.  As  I  demonstrate  later  in  this  chapter,  in  addition  to  being 
less  expensive  to  implement,  this  special  organization  has  comparable  performance  to  a 
traditional  one  with  the  same  number  of  registers.  Consequently,  although  the  cost  ol 
adding  more  registers  relative  to  an  entire  computer  is  less  significant,  the  cost/performance 
ratio  can  stiU  be  improved  when  using  this  special  organization.) 

One  possible  solution  for  improving  the  tradeoff  between  increased  cost  and  per¬ 
formance  gain  is  to  change  the  number  of  elements  per  vector  register,  which  is  dso  known 
as  the  vector  length.  Halving  the  vector  length  from  64  to  32  while  doubbng  the  number 
of  vector  registers  from  8  to  16  results  in  a  15%  rather  than  25%  increase  in  chip  count. 
Halving  the  vector  length  again  to  16  results  in  only  a  10%  increase  in  chip  count.  A  reason- 
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abk  vector  length,  however,  it  needed  to  hide  both  the  latency  of  vectorizable  operations 
and  the  execution  of  scalar  operations.  Moreover,  when  functional  units  are 
some  number  of  clock  periods  greater  than  the  vector  length,  as  is  the  case  in  the  Cray 
MP  processor,  the  performance  penalty  of  this  extra  delay  can  be  “ 

appropriately  long  vector  length.  As  a  result,  although  a  vector  length  of  16  appears  to 
b^an  acceptable  tradeoff  between  increased  cost  and  improved  performance,  sustaina  le 
performance  may  be  adversely  affected  when  the  vector  length  is  shortened  so  much. 

Because  the  size  of  an  interconnection  is  determined  by  the  number  of  buses  at¬ 
tached  to  it,  another  possible  solution  for  improving  the  cost/performance  tradeoff  is  to 
have  more  than  one  vector  register  share  a  bus.  As  Figure  6.1  shows,  although 
vector  registers  still  increases  the  hardware  cost,  the  rise  in  cost  is  much  less  for  this  new 
configuraLn  than  for  a  traditional  one  with  the  same  number  of  registers.  Another  way  to 
describe  this  new  configuration  is  that  it  partitions  the  vector  registers  into  ^ 

each  bank  has  its  own  read  and  own  write  buses.  Based  on  this  description,  I  call  such 

configuration  a  partitioned  vector  register  file.  ■«  pptt 

Both  the  Ardent  Titan,  a  commercial  vector  computer,  and  the  Fujitsu  VI' U, 
VLSI  implementation  of  a  vector  processor,  have  implemented  a  partitioned  vector  register 
file.  In  addition  to  being  partitioned,  these  vector  register  files  are  also  ^®““^abl 
that  the  number  of  vector  registers  and  their  vector  length  can  be  varied  under  software 
control.  The  vector  register  file  in  the  Ardent  Titan  can  be  viewed  as  a  partitioned  one  with 
four  banks,  each  of  which  contains  2048  elements  (31].  The  nuinber  of  vector  ^egtst^rs  in  a 
bank  can  vary  from  1  to  2048,  depending  upon  the  vector  length  which  caji  ^ 

down  to  1.  The  vector  register  file  in  the  Fujitsu  VPU  also  has  four  banks,  each  of  which 
contains  256  elements  [64].  The  number  of  vector  registers  per  bank  c^  vary  from  2  to 
depending  upon  the  vector  length  which  can  vary  from  128  down  to  16. 

Figure  6.2  shows  a  rudimentary  analysis  of  the  tradeoff  between  increased  cost  and 
improved  performance  among  various  configurations  of  partitioned  register  es,  w  ere  e 
performance  of  a  partitioned  register  file  approximates  the  performance  of  a  traditional  one 
with  the  same  number  of  registers.  Based  on  this  approximation  for  performance,  using 
16  registers  appears  to  be  an  acceptable  cost/performance  tradeoff  if  4  or  ^  ^ 

to  sh^e  a  bus.  Likewise,  a  partitioned  register  file  with  32  registers  and  4  or  8  buses,  which 
improves  performance  by  nearly  18%,  is  nearly  an  acceptable  tradeoff  between  incre^ed 
cost  and  improved  performance,  although  it  is  not  as  cost-effective  as  one  with  16  re&stCT  _ 
In  contrast,  using  64  registers  wiU  never  achieve  an  acceptable  cost/performance  tradeoff 
because,  even  though  performance  is  improved  by  about  18%,  the  relative  increase  in  cos 
is  at  best,  more  than  three  times  higher  no  matter  how  many  registers  share  a  bus.  Une 
other  interesting  configuration  to  consider  from  a  cost  standpoint  is  a  partitioned  register 
file  with  4  buses  and  8  registers.  This  yields  an  acceptable  cost /performance  tradeoff  as 
long  as  performance  does  not  degrade  more  than  7.5%;  the  best  result  is  achieved  when  no 

performance  degradation  occurs  at  all.  .  .  ,  •  ci 

Based  on  this  initial  cost/performance  analysis,  using  a  partitioned  register  file 

^Because  I  always  deal  with  vector  objects  in  this  chapter,  I  use  the  term  element  to  refer  to  a  register 
in  a  vector  register"  the  term  uector  r^^ister  interchangeably  with  register,  and  the  term  partitioned  uector 
register  file  interchangeably  with  partitioned  register  file. 
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Figure  6.1:  Relative  Differences  in  Chip  Count  Among  Vector  Register  Files 

partitioned  register  file  with  B  buses  and  R  registers  is; 

=  f  £  X  1.5  +  5  -  2,5')  0.10 

^  \  Q  K  / 


Because  the  difference  in  eosl  is  rei.li..  to  the  cost  for  6  buses  and  6  registers,  using  4  buses  «.d 
16  or  fewer  registers  actually  results  in  a  decrease  in  the  overall  chip  count. 

Based  on  data  in  Figure  6.2,  which  shows  the  tradeoffs  between  increased  cost 
formance  for  partitioned  register  files  with  16,  32,  and  64  registers,  the  two  relative  ^  « 

lighted  in  bold  above  correspond  to  configurations  that  could  provide  an  acceptable  cost/performan 

tradeoff. 


_  relative  increase  in  chip  count: 

_ _ relative  improvement  in  perfo^MC^ 
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Figure  6.2:  Cost/Pertormance  Comparisons  for  Various  Partitioned  Register  Files 

This  figure  combines  cost  datafrom  Figure  6.1  and  performance 
from  the  previous  chapter  in  order  to  compare  the  relal.ve  mcre.se  m  ch.p  count  re.lh  the  rel.t.ve 

-eglster  «1.  where  ..h 

reeistet  has  its  own  bus,  I  have  also  shown  the  cost  of  implemenlmg  vmious  partitioned  regis 

bound  on  the  improvement  possible  for  the  indicated  number  of  registers.  m 

A  general  rule-of-thumb  tor  justifying  a  mote  costly  implementation  is  if  a  comparable  incteas 
in  performance  can  also  be  achieved.  The  above  figure  shows  that  implemenlmg  64  tegisfors  »  «wr 
cost-effective,  whereas  using  8  or  4  buses  with  16  registers  appears  to  be  an  acceptable  tradeoff 
between  increased  cost  and  improved  performance.  _ ^ 
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with  16  registers  appears  to  be  an  acceptable  tradeoff  between  increased  cost  and  improved 
performance.  But  there  is  little  point  in  implementing  a  partitioned  register  file  if  it  can  t 
be  used.  Moreover,  in  the  cost/performance  analysis  described  above,  I  assumed  that 
the  performance  using  a  partitioned  register  file  is  the  same  as  the  performance  using  a 
traditional  one  with  the  same  number  of  registers.  This  assumption,  of  course,  is  not 
necessarily  true.  Although  a  partitioned  register  file  is  less  costly  to  iinplement,  such  a 
configuration,  because  multiple  registers  share  a  bus,  imposes  more  restrictions  on  accessing 
vector  registers.  As  a  result,  the  performance  of  a  partitioned  register  file  could  be  less  than 

that  of  a  traditional  one  with  the  same  number  of  registers. 

These  two  shortcomings  provide  several  goals  for  this  chapter’s  study.  One  goal  is 
to  design  an  algorithm  that  can  use  a  partitioned  register  file.  Another  goal  is  to  evaluate 
how  effective  the  algorithm  is  and  to  determine  whether  the  performance  of  a  partitioned 
register  file  is  comparable  to  that  of  a  traditional  one  with  the  same  number  of  registers. 
A  third  goal  of  this  study,  which  is  appropriate  only  if  the  first  two  are  successful,  is  to 
provide  data  that  determines  whether  a  partitioned  register  file  with  16  registers,  in  fact, 
yields  an  acceptable  tradeoff  between  increased  cost  and  improved  performance. 

Finally,  to  provide  an  intuitive  understanding  of  my  results,  I  make  an  important 
observation  that  gives  a  clue  to  both  how  and  how  well  a  partitioned  register  file  can  be 
used.  To  use  a  partitioned  register  file  effectively,  1  take  advantage  of  the  fact  that. 

only  a  subset  of  simultaneously  live  values  are 

actually  used  at  any  given  time 

For  example,  in  Figure  5.3  (on  page  77  in  the  previous  chapter)  in  the  fourth  chime  of  the 
execution  order  on  the  left,  only  three  out  of  five  live  values  are  actually  being  used.  Values 
that  are  live  at  the  same  time  must  be  stored  in  different  registers  to  avoid  generating  spill 
code.  However,  if  they  are  never  used  at  the  same  time,  they  can  be  stored  in  registers  that 
share  the  same  bus.  For  example,  using  the  execution  order  on  the  left  in  Figure  5.3,  the 
values  produced  by  +«  and  LOADn  can  share  the  same  bus  but  not  the  values  produced  by 
+2  and  LOADji.  In  other  words,  registers  store  live  values,  buses  transfer  vdues  that  are 
being  operated  on,  and  there  are  more  live  values  than  active  ones  at  any  given  time, 
the  rest  of  this  chapter,  I  present  graphical  representations  of  register  and  bus  usages  to 
model  the  usage  of  a  partitioned  register  file,  and  present  data  showing  that  the  number 
of  simultaneously  active  values  is  sufficiently  smaU  that  a  partitioned  register  file  is  an 
acceptable  cost /performance  configuration  for  implementing  more  registers. 

6.2  Assignment  Algorithm  for 
a  Partitioned  Register  File 

In  this  section,  I  describe  the  algorithm  I  developed  to  use  a  partitioned  register 
file  where  more  than  one  register  shares  a  bus.  Because  most  people  tend  naturally  to  focus 
on  on  the  execution  of  operations,  they  often  consider  the  use  of  registers  (and  buses)  from 
the  perspective  of  the  operations  that  use  these  resources.  In  other  words,  an  operation 
reads  one  or  two  operands  from  registers  and  produces  a  result  that  is  stored  in  a  different 
register,  which  in  turn  is  read  as  an  operand  for  one  or  more  additional  operations. 
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This  point  of  view  about  operations,  however,  does  not  provide  the  proper  frame 
of  reference  for  determining  how  to  use  registers  and  buses.  To  do  so,  focus  must  instead  be 
placed  upon  the  operands  and  results,  which  are  seen  as  values,  each  of  which  is  produced  by 
some  operation.  A  value  is  stored  in  a  register  from  when  it  is  first  produced  by  an  operation 
to  when  it  is  last  read  by  another  operation.  Moreover,  a  value  is  transferred  on  a  bus  when 
it  is  first  produced  as  a  result  and  for  each  time  that  it  is  read  as  an  operand.  This  point  o 
view  shows  that  values  are  assigned  to  registers  and  buses  to  use  these  resources  properly. 
In  other  words,  using  a  partitioned  register  file  consists  of  two  assignment  problems: 

1.  assigning  values  to  registers  without  causing  any  register  conflicts,  and 

2.  assigning  values  to  buses  without  causing  any  bus  conflicts. 

Access  conflicts,  such  as  WAR  or  WAW  register  dependences,  are  to  be  avoided  because  they 
add  extra  chimes  to  execution  time.  These  two  problems  are  related  in  that  assipments  for 
either  one  are  made  under  the  additional  constraint  that  a  fixed  number  of  registers  share 

a  bus. 

One  of  the  goals  of  this  study  is  to  determine  whether  the  performance  of  a  par¬ 
titioned  register  file  is  comparable  to  that  of  a  traditional  one  with  the  same  number  of 
registers.  In  the  context  of  these  assignment  problems,  this  goal  is  equivalent  to  deter¬ 
mining  whether  the  number  of  buses  must  increase  with  the  number  of  registers  in  order 
to  avoid  any  access  conflicts.  The  algorithm  I  developed  thus  attempts  to  minimize  the 
number  of  buses  and  registers  assigned.  Minimizing  register  usage,  however,  can  increase 
bus  usage,  and  vice  versa.  Because  increasing  the  number  of  buses  requires  more  hardwwe, 
higher  priority  is  given  to  minimizing  the  number  of  buses  assigned.  Hence,  my  algorithm 
really  consists  of  two  algorithms  e,xecuted  in  sequence:  the  first  one  assigns  values  to  buses 
and  then,  for  each  assigned  bus,  the  second  algorithm  assigns  values  to  registers  that  share 

a  bus.  .  .  .  ij  \ 

My  algorithm  is  also  influenced  by  the  cft77  compiler  (in  which  it  would  execute) 

and  by  practical  considerations  in  the  hardware.  The  input  to  this  ^gorithm  is  an  execu¬ 
tion  order  for  a  dependence  graph,  rather  than  the  dependence  graph  itself,  because  register 
assignment  occurs  after  scheduling  in  the  cft77  compiler.  Moreover,  because  a  dependence 
graph  corresponds  to  a  basic  block,  I  consider  only  local  register  assignment.  Finally,  in 
a  practical  hardware  conflguration,  the  number  of  registers  per  bus  is  fixed  and  equd  to 
the  number  of  registers  divided  by  the  number  of  buses.  This  quotient  is  an  independent 
parameter  that  allows  my  algorithm  to  be  used  for  a  partitioned  register  file  with  an  arbi¬ 
trary  but  fixed  amount  of  partitioning.  Additionally,  to  keep  the  implementation  simple, 
the  number  of  registers  per  bus  is  a  power  of  two. 

To  keep  this  algorithm  simple,  I  do  not  consider  the  possibility  of  register  spilling. 
Moreover,  although  the  hardware  allows  assignments  of  the  form  Vi<-Vi  op  Vj,  my  algo¬ 
rithm  never  uses  such  assignments;  instead,  it  always  assigns  a  different  register  for  each 
value  read  or  produced  by  an  operation.  I  address  both  these  omissions  in  a  later  section 
when  I  evaluate  the  effectiveness  of  this  algorithm. 

outline  the  various  types  of  algorithms  for  register  assignment  when  I  discuss  related  studies  in  Sec¬ 
tion  6.5. 
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To  model  the  oee  of  e  partitioned  register  file,  my  assignment  algorithm  is  based 
on  the  optimal  graph  coloring  problem,  which  poses  the  question: 

What  is  the  minimum  number  of  colors  needed  to  color 
a  graph  such  that  no  adjacent  vertices  are  assigned  the 

same  color?  [46] 

The  formulation  of  register  assignment  as  a  graph 
resourcS  to  be  assigned  and  the  graph,  called  an 

I  examine  the  structure  of  these  interference  graphs  to 

NP-hard.  Although  coloring  an  arbitrary  graph  is  known  to  be  NP  ,  g 

graphs  can  he  us^d  to 

p°a“  mS;.er  hie.  and  how  the  specitd  structure  of  the  one  graph 
guarantees  that  the  minimum  number  of  registers  is  used  per  bus. 

6.2.1  Two  Interference  Graphs 

In  my  assignment  algorithm,  I  use  two  interference  graphs:  a  line  inlei/crencc  graph 
for  assig,?nrra7stos,  and  an  active  mfer/erauce  graph  for  assigning  buses.  Figure  6,3 
hows  a®  small  Lmple  of  these  two  graphs  and  illustrates  how  "Ire 

t  flip  same  time  an  edge  is  placed  in  a  live  interference  graph  between  two  values  th 
^  •  Itanpouslv  live  in  order  to  model  how  register  conflicts  can  occur.  In  the  execution 

“LnfpTgrW  live  at  the  sLe  time  because  their  associated  arrows 

“ccm  to  tt  same  ciilme.  Hence,  there  is  an  edge  among  aU  the  values  rn  the  corresponding 

hve  i”‘"fe-«Xl^hion,  whether  two  vertices  are  connected  In  ^ 

I.  J  gtTt  usaee  Because  two  values  assigned  to  the  same  bus  wiU  conflict  if  the 

vdues  M^either  read  or  written  at  the  same  time,  an  edge  is  placed  in  »otive  interferrace 
;th  w^n  two  values  that  are  simultaneously  read  or  simultaneous  y  wrU^  .-td 
?o  Ldel  how  bus  conflicts  can  occur.  In  the  execution  order  of  Figure  6.3,  valu 
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Figure  6.3:  Live  and  Active  Interference  Graphs 

This  figure  illustrates  the  part  of  the  live  and  active  interference  graphs  associated  with  the 
second  chime  of  the  execution  order  on  the  right  in  Figure  5.3  (on  page  77  in  the  previous  chapter). 
This  portion  of  the  execution  order  is  reproduced  above.  A  vertex  in  an  interference  graph  represents 
a  value,  identified  in  the  above  figure  by  the  operation  that  produces  it.  An  edge  represents  a 
potential  resource  conflict  between  its  incidental  vertices. 

This  figure  also  illustrates  how  coloring  these  graphs  provides  a  conflict-free  assignment  of  values 
to  registers  and  buses,  where  more  than  one  register  can  share  a  bus.  The  individual  interference 
graphs  in  the  middle  of  the  above  figure  demonstrate  how  coloring  the  live  interference  graph  assigns 
values  to  registers  (denoted  above  as  the  four  “colors”  RO,  Rl,  R2,  and  R3)  and  how  coloring  the 
active  interference  graph  assigns  values  to  buses  (denoted  above  as  the  three  “colors”  BO,  Bl,  and 
B2).  Once  colored,  these  two  graphs  together  indicate  which  values  can  be  stored  in  registers  that 
share  the  same  bus  without  causing  any  resource  conflicts  that  increase  execution  time. 
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written  into  n  regUter  if  the  beginning  of  its  arc 

conaict-free  assignment  of  values  to 

:e^a:d7to“lrer‘woI'dT/a”minrum  o?fo^^  registers  (denoted  RO,  R1  ^d  R3)  are 

TiPeded  to  store  these  values  without  any  access  conflicts  among  registers, 
active  interference  graph 

r  n::- d^n^sefrat'd  :cce:fconaVcts  among  registers  or  buses,  a  condition  that 
u“  :°:a;rs°::u:ntu;  hve  am  not  necessarii,  simultaneous, 

rthe“ed  utetteLnce  graph,  Tbis  fact  aUows  my  algorithm,  when  ,.  asstgns 

”"”“"r:'cr:r.bis:i:« 

!  •  ^»tion  that  uses  detailed  timing  information  about  the  execution  of  vector  mstruc- 
"  e^  ilXXaphs  thf.  more  Cose,  re«ect  what  ac.u^y  happen^rn  «  e 
Wdw^irp  narticularlv  for  a  processor  that  supports  fine-gram  paraJlehsm.  More  accurate 
^ntreml^eC^hst  turn  ^ow  a  compiler  - 1" 

“I'lfoatSe^vTdre;^ 

posi  10  g  reffisters  are  needed  to  avoid  access  conflicts.  Consequently,  an  im 

must  incude  the  effects  of  operational  latenCes 

when  these  interference  graphs  are  constructed. 

To  summarize,  live  and  active  interference  graphs,  which  model 
and  bus  conflicts  respectively,  are  constructed  from  an  execution  order  of  a  dependenc 
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graph  there  are  edge*  between  valaes  that  are  used  m  the  same  way  at  the  same  . 

6  2  2  Structure  of  Live  and  Active  Interference  Graphs 

Although  coloring  an  arbitrary  graph  is  known  to  »e  NP-hard  coloring  mte^J 
graphs  can  he  done  in  P^-f-id  time  »  Hence, 

assignment  problem  is  NP  ha  ,  tv-oco  <rranhs  The  traditional  diagram 

Uttlt  wSrarf  e^rcaptSre  this  timing  information  to  reveal  any 

special  pmph.<  so  named  because  it  can  he  repre- 

sented  by  a  set  of  intervals  on  the  real  line: 

Given  a  coUection  of  intervais  I, . /„  on  the  real  line,  an  •"'crvul  P™ph  h  a 

grap!!  where  each  vertex  represents  an  interval  and  there  .s  ^  edge  between 
vertices  i  and  j  if  and  only  if  the  intervals  A  and  I,  overlap  [49]. 

When  drawn  as  an  interval  graph,  the  lifetime  of  ‘he  •!*  rLTcemtm  of  A  on'tt 

finite"  rfh! 

'and  as  an  intrval  graph.  Not  all  graphs  can 
graph: 


An  active  interference  graph,  on  the  other  hand,  is  an  mterseclfon  ,ropk: 

Given  a  collection  of  sets  S, . S„  of  intervals  on  the  real  hue,  an  ‘"‘ccsectmn 

:;r= 

whe^the  i'‘  value  is  actuaUy  active.  Figure  6.4  shows  an  active  interference  graph  drawn 

^rS"r::fti:n;r:=^^^  -ucture  because  any  achitrary 
graph  can  be  described  as  an  intersection  graph: 
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Figure  6.4:  Alturuative  Representatious  for  Live  aud  Active  Interference  Graphs 

This  flsure  8ho»s  ho»  the  live  and  active  interference  graphs  in  Figure  6.3  can 
a  coliection  of  time  intervals  on  the  real  line.  These  alternative  representations  « 

information  ignored  by  the  traditional  drawing  of  a  graph.  A  line  in  the  graphs  illus  rated  he 
different  Lponent  of  a  graph;  a  line  in  the  traditional  drawing  represent,  an  dg^ 
wLeas  one  in  an  interval  or  intersection  graph  represents  a  vertex.  To  draw  an  active 
graph  as  an  intersection  graph,  I  use  a  dashed  interval  to  represent  when  a  ^ 

fnterval  to  represent  when  a  read  occurs.  Edges  in  interval  and  intersection  graphs  are  not  explicitly 
A  in  hiit  are  instead  implicitly  represented  by  overlapping  intervals.  In  the  above  figu  , 
igTre  the  effects  of  operational  latencies  for  the  sake  of  simplicity  and  assume  that  valuer  active  in 
the  same  chime  are  active  at  exactly  the  same  times. 
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Any  finite  graph  can  be  considered  the  intersection  graph  of  a  collection  of  sets: 
an  edge  is  an  element  in  the  sets  and  the  set  associated  with  a  vertex  contains 
the  edges  incident  to  that  vertex  (82). 

For  example,  the  following  figures  represent  three  different  forms  of  the  same  graph  where 
a  number  labels  an  edge  and  a  set  labels  a  vertex: 


The  figure  on  the  left  represents  the  traditional  way  a  graph  is  drawn,  whereas  the  middle 
one  shows  how  a  collection  of  four  sets  represents  the  same  graph  and  the  figure  on  the 
right  represents  the  graph  drawn  as  a  collection  of  overlapping  intervals.  To  put  this  in 
the  context  of  my  assignment  problem,  an  active  interference  graph  is  an  arbitrary  graph. 
Moreover,  any  intersection  graph  corresponds  to  a  sequence  of  loads  and  stores  from  which 
an  active  interference  graph  can  be  built: 

The  leftmost  interval  on  the  real  line  in  each  vertex  Si  of  an  intersection  graph 
corresponds  to  a  load,  while  the  rest  of  the  intervals  in  Si  correspond  to  stores 
that  are  dependent  on  Si's  load.  The  order  in  which  these  loads  and  stores  are 
executed  is  the  same  as  the  order  in  which  the  intervals  from  all  the  vertices 
appear  from  left  to  right  on  the  real  line. 

Hence,  coloring  an  active  interference  graph  is  NP-hard. 

In  addition  to  revealing  the  structure  of  my  interference  graphs,  interval  and  inter¬ 
section  graphs  also  serve  as  convenient  visual  aids  for  developing  an  assignment  algorithm 
and  debugging  its  implementation.  Although  extremely  small  graphs  can  be  drawn  in  the 
traditional  fashion,  this  becomes  rather  messy  when  the  number  of  vertices  or  edges  reaches 
even  a  modest  quantity,  such  as  seven,  and  many  of  my  interference  graphs  have  substan¬ 
tially  more  vertices  and  edges  than  this;  these  graphs  can  be  easily  drawn  as  either  interval 
or  intersection  graphs  on  a  single  page.  For  example.  Figure  6.5  shows  the  live  and  active 
interference  graphs  for  one  of  the  loops  in  the  CRI  workload  that  has  47  values. 

In  this  subsection,  I  emphasized  the  usage  of  intervals  on  the  real  line  to  represent 
my  interference  graphs.  On  the  other  hand,  a  computer  uses  yet  another  representation  of 
a  graph  that  is  described  in  the  previous  subsection.  All  of  these  different  representations 
portray  the  same  graph  with  varying  amounts  of  information  captured  in  the  description.  In 
addition  to  revealing  the  structure  of  these  graphs,  the  conceptual  representations  described 
above  enable  a  human  to  better  understand  this  assignment  problem.  Furthermore,  I  use  the 
special  structure  of  the  interval  graph  to  prove  (in  the  next  section)  that  a  polynomial-time 
algorithm  uses  the  minimum  number  of  registers  possible  to  assign  values  to  registers. 

6.2.3  Assigning  Values  to  Buses  and  Registers. 

In  this  subsection,  I  describe  tlie  algorithm  I  developed  to  use  a  partitioned  register 
file.  The  ideal  goal  of  this  algorithm  is  to  use  the  minimum  number  of  buses  and  registers 
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Figure  6.5:  A  Large  Example  of  Live  and  Active  Interference  Graphs 

This  figurs  demonslrstss  large  live  aad  active  interference  graphs  that  Me  constructed  from  a 
loop  in  the  cm  workload.  Each  graph  contains  47  values.  A  value  in  the  live  InterferMtce  graph 
is  represented  by  a  dotted  line,  whereas  a  value  in  the  active  interference  graph  B  ..preMntMi  by 
collection  of  short  intervals  that  are  horirontally  aligned.  A  dashed  intervtj  indicates  when  a  v.ke 
is  written  and  a  solid  interval  indicates  when  it  is  read ,  As  an  indication  of  the 
graphs,  the  seventh  value  has  at  least  33  edges  in  the  active  interference  graph  to  represent  potential 

flirts  in  read  accesses. 
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without  cuusiug  any  accets  conflicts,  such  as  WAR  or  WAW  register  depeodencK,  whi^ 
add  extra  chimes  to  execution  time.  However,  because  I  demonstrated  that  coloring  an 
active  interference  graph  is  NP-hard,  finding  the  minimum  number  of  buses  and  register 
for  my  assignment  problem  is  also  NP-hard. 

Nevertheless,  as  with  the  scheduling  problem  described  in  the  previous  chapter, 
algorithms  based  on  heuristics  can  minimize  bus  and  register  usage  although  there  is  no 
guarantee  that  the  minimum  has  been  achieved.  Although  I  would  ideally  ° 
loth  the  number  of  buses  and  the  number  of  registers,  this  is  not  always  possible  due  to 
conditions  that  oppose  minimizing  each  quantity.  Moreover,  because  incre^ing  the  number 
Tbu^requires  more  hardware  than  increasing  the  number  of  registers,  higher  priority  is 

given  to  minimizing  the  number  of  buses. 

My  assignment  algorithm  actually  consists  of  two  algorithms,  executed  in  sequence. 

Because  minimising  bus  usage  has  higher  priority,  the  first 

and  then  for  each  assigned  bus,  the  second  algorithm  assigns  values  to  registers  that  share  a 
bus.  To  bound  their  execution  time,  these  algorithms  do  not  backtrack  nor  do  they  r^^ass^n 
buses  or  registers  in  an  attempt  to  use  fewer  resources.  The  organization  of 
algorithms  is  similar  to  that  of  scheduUng  algorithms  in  that  vJnes  “  “ 

particular  order,  and  a  bus  or  register  is  chosen  based  on  some  sfratejp.  What  the  order 
and  strategy  are  depends  largely  upon  what  the  goal  of  an  algorithm  is. 

Figure  6.6  gives  the  algorithm  that  examines  an  active  interference  grap  to  assign 
values  to  buses  so  that  bus  conflicts  are  avoided.  Because  the  number  of  registers  per  bus 
"n  the  hardware,  my  algorithm  also  inspects  the  associated  -Bferference 
to  ensure  that  no  more  than  that  number  of  registers  would  be  needed  per  bus  to  avoid 
register  conflicts.  Its  basic  strategy  is  to  use  buses  as  often  as  possible  and  introduce  a  new 
bus  only  if  all  available  ones  have  been  used  or  if  too  many  values  that  are  simultaneously 

live  have  already  been  assigned  to  those  buses. 

Because  the  goal  of  this  algorithm  is  to  minimize  the  number  of  buses  that  are 
assigned,  it  would  seem  judicious  to  first  process  values  that  have  more  potential  conflicts 
nnder  the  rationale  that  a  vaine  with  more  edges  is  more  likely  to  be  m  a  pMt  of  the 
graph  that  is  highly  connected.  But  because  this  rationale  .s  not  always  justified  iny 
LSthm  instead  Jsigns  values  in  the  order  they  are  created,  an  order  that  corresponds  to 
fhe  Ixeclon  order  of  the  operations  that  produce  the  valu..  A>.kough  .ns  n^  poss  M 
to  always  use  the  minimum  number  of  buses,  /nines  m  order  of  creM, on  tirne 

guarantees  that  the  minimum  number  of  registers  per  bus  will  be  used,  which  in  turn 
fncreases  the  chances  of  using  fewer  buses.  As  1  wiU  explain  in  the  following  Pnr^^’ 
this  guarantee  is  possible  because  of  the  special  structure  of  a  live  interference  graph. 

Figure  6.7  gives  the  algorithm  that  examines  a  Uve  interference  graph  to  assign 
values  to  registers  so  that  access  conflicts  among  registers  are  avoided.  Algorithms  similar 
to  this  one  are  described  in  standard  compiler  texts  [2,  51].  In  a  fashion  similar  o 
algorithm  for  bus  assignment,  the  basic  strategy  for  assigning  registers  is  to  use  registers 
as  often  as  possible  and  to  introduce  a  new  register  only  if  all 
have  been  used.  An  important  characteristic  of  this  algorithm  is  that  it 
the  minimum  number  of  registers  for  any  live  interference  graph  and  do  so  in  a  time  tha 
is  polynomially  proportional  to  the  number  of  values  in  the  graph.  In  contrast,  oth 


Given  an  execution  order  of  a  dependence  graph,  assign  a  bus  to  each  value  in  order 
of  creation  time  by  choosing  the  first  bus  b  such  that. 

1.  no  neighbor  in  the  active  interference  graph  is  assigned  to  bus  b,  and 

2.  no  more  than  R/B  -  1  neighbors  in  the  live  interference  graph  are  assigned  to 
bus  b. 

If  no  such  bus  exists,  choose  a  new  bus  that  has  not  yet  been  assigned. 


Figure  6.6:  An  Algorithm  for  Assigning  Values  to  Buses  in  a  Partitioned  Register  File 

This  algorithm  assigns  buses  to  values  that  are  produced  and  used  in  a  particular  order  for  a 
dependence  graph.  The  creation  time  of  a  value  is  when  a  value  is  produced  by 
hence,  corresponds  to  when  that  operation  begins  executing.  The  goals  of  this  algorithm  we  to 
avoid  conflicts  due  to  register  and  bus  accesses  where  multiple  registers  share  a  bus,  and,  in  d  mg 

so  to  minimize  the  number  of  buses  and  registers  used. 

Information  in  the  interference  graphs  is  used  to  avoid  access  conflicts.  Bus  conflicts  are  guar¬ 
anteed  not  to  occur  by  choosing  a  bus  not  already  assigned  to  a  value  that  is  used  in  the  same  way 
at  the  same  time.  When  values  are  simultaneously  active  in  the  same  way  is  recorded  in  the  active 
interference  graph.  Access  conflicts  among  registers  are  guaranteed  not  to  occur  by  choosing  a  b 
not  already  assigned  to  R/B  values  that  are  simultaneously  live  with  the  value  being  processed, 
where  R  is  the  number  of  registers  in  hardware  and  B  is  the  number  of  buses.  Such  information  is 


recorded  in  the  live  interference  graph. 

Minimizing  the  number  of  buses  and  registers  is  accomplished  in  separate  ways.  The  number 
of  buses  used  is  minimized  by  choosing  a  new  bus  only  when  all  previously  assigned  ones  will  cause 
a  bus  conflict  if  chosen.  The  number  of  registers  used  is  minimized  by  processing  values  in  the  order 
in  which  their  associated  operations  are  executed.  Assigning  values  in  this  order  causes  the  number 
of  values  simultaneously  live  with  the  value  being  processed  to  be  a  true  reflection  of  the  minirnum 
number  of  registers  needed  at  that  point.  As  explained  in  the  accompanying  mam  text,  this  true 
reflection  arises  as  the  result  of  the  special  structure  of  the  live  interference  graph.  However,  unlike 
the  algorithm  in  Figure  6.7,  more  than  the  minimum  number  of  registers  may  be  used  because  hig  er 
priority  is  given  to  minimizing  the  number  of  buses,  which  tends  to  increase  the  number  of  registers 


used. 
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Given  an  execution  order  of  a  dependence  graph,  assign  a  register  to  each  value  in 
order  of  creation  time  by  choosing  a  register  r  such  that. 

1.  register  r  is  among  those  that  have  already  been  assigned  to  at  least  one  of  the 
previously  processed  values,  and 

2.  no  neighbor  in  the  live  interference  graph  is  assigned  to  register  r. 

If  no  such  register  exists,  choose  a  new  register  that  has  not  yet  been  assigned. 


Figure  6.7:  An  Algorithm  for  Assigning  the  Minimum  Number  of  Registers 

This  algorithm  assigns  registers  to  values  that  are  produced  and  used  m  a  particular  order  for 
a  dependencf  graph.  The  goals  of  this  algorithm  are  to  avoid  conflicts  due  to  register  accesses  an  , 

in  doing  so,  to  use  the  minimum  number  of  registers.  .  ,  j 

Register  conflicts  are  guaranteed  not  to  occur  by  choosing  a  register  that  is  not  already  assig 

to  a  simultaneously  live  value.  ,  u  n 

The  number  of  registers  used  is  minimized  by  choosing  a  new  register  only  when  all  previously 

assigned  ones  will  cause  conflicts  if  chosen.  In  addition,  because  values  are  assigned  in  the  order  of 
their  creation  times,  the  minimum  number  of  registers  is  always  assigned,  a  fact  that  is  proven  in 

the  accompanying  text. 

polynomial-time  algorithms  do  not  always  use  the  minimum  number  of  colors.  Such  an 
!lgLthm  results  from  a  slight  modification  of  the  algorithm  in  Figure  6.7,  where  registers 
are  assigned  in  order  of  the  number  of  potential  conflicts  a  v^ue  has  instead  of  by  its 
creation  time.  To  iUustrate  that  one  algorithm  is  optimal  and  the  other  is  not,  Figur  ^ 
shows  the  assignments  produced  by  these  algorithms  for  the  same  live  interference  graph. 

The  foUowing  two  paragraphs  prove  why  the  modified  algorithm  is  not  an  optimal 
one  and  the  algorithm  in  Figure  6.7  is.  These  proofs  are  based  on  the  following  condition 

for  optimality: 

If  a  value,  which  is  about  to  be  processed,  is  live  at  the  same  time  as  some  other 
values  that  have  already  been  processed,  then  these  other  values  must  also  be 
simultaneously  live  with  each  other. 

If  this  condition  is  always  satisfied  by  an  algorithm,  then  the  number  of  registers  used  by 
such  an  algorithm  equals  the  maximum  number  of  simultaneously  live  values,  which  is  the 
minimum  number  of  registers  needed  for  an  execution  order  to  avoid 
fas  I  explained  in  Section  5.1.1  of  Chapter  5,  Register  Usage  and  Instruction  Scheduling) 
This  condition,  however,  is  sufficient  but  not  necessary  for  an  algorithm  to  be  optiinal. 
For  example.  Ford  and  Fulkerson  describe  a  different  algorithm  that  does  not  satisfy  this 

condition  but  is  nonetheless  optimal  (43,  pages  64-67]. 

The  modified  algorithm  does  not  always  satisfy  the  optimality  condition  because 

it  is  possible  to  process  a  value  after  assigning  registers  to  values  that  are  simultaneously 
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Figure  6.8:  One  Example  of  Two  Algorithms  that  Assign  Values  to  Registers 

This  figure  demonstrates  how  two  slightly  dififerent  algorithms  that  assign  values  to  registers 
use  different  numbers  of  registers  for  the  same  live  interference  graph.  For  this  ^ 

labeled  A  through  K.  The  assignment  on  the  left  is  produced  by  the  algorithm  hsted  in  6.  , 

whereas  the  assignment  on  the  right  is  produced  by  the  same  algorithm  but  with  a  minor  modifica¬ 
tion:  the  order  in  which  values  are  assigned  is  based  on  the  number  o  potential 
rather  than  its  creation  time.  The  number  of  potential  conflicts  a  value  has  is  equal  to  the  number 


of  edges  incident  upon  it.  .  , 

Not  only  does  the  assignment  on  the  left  use  one  fewer  register,  but  it  also  uses  the  minimtim 

number  possible;  using  fewer  than  three  registers  for  this  live  interference  graph  would  result  m 
register  conflicts.  Moreover,  the  algorithm  that  produces  the  assignment  on  the  left  will  always 
use  the  minimum  number  of  registers  for  any  live  interference  graph,  whereas  the  slightly  modified 


algorithm  will  not. 
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u.  wi.h  i.  .u.  no.  wj..  eno.  o....  So.  ^ .tlaLf.!": 
assignment  on  the  „a  no.  aJl  simnltaneonsly  live  with  each 

len  ^n“hetes’,  by  the  time  value  F  is  processed,  its  three  neighbors  have  been  assigned 
to  different  registers  as  follows: 

Because  value  G  and  another  value,  C,  have  the  most  number  of  potential  com 

flics  but  are  not  simultaneously  Uve,  these  Wo  ^ues  are  ° 

first  Although  value  C  could  have  been  assigned  a  register  different  than  RO,  re- 

Ll  that  a  goal  of  these  algorithms  is  to  minimize 

value  C  is  assigned  its  register,  there  is  not  enough  information  to  know  that  a 
different  register  could  have  been  used  without  increasing  the  overall  number  of 
r Se^  TO  hou.  this  information,  which  becomes  available  after  proi^sing 
rIL  values,  and  given  the  algorithm’s  goal,  value  C  is  assigned  the  register  RO. 

The  next  values  to  be  assigned  registers  are  values  D,  E,  and  F,  which  all 
I^eZmber  of  potential  conflicts.  1  have  arbitrarily 
and  E  first,  which  are  assigned  to  two  new  registers,  R1  and  R2,  because 
two  values  are  simultaneously  live  with  each  other  and  with  value  C. 

SHSfsSSSjgSS 

overlaps  those  of  o  ontLality  condition  is  always  satisfied  when  processing  a 

:r::"bt In.m«m  number  of  registers  needed  to  avoid  any 

register  described  above  highUght  the  importance  of  the  relative 

au  ^  ■r,fc  of  lifetimes  and  algorithms  that  do  not  use  this  information  tend  to  be  less 

the  end^i  j^pJe  the  algorithm  based  on  the  number  of  potential  conflicts  is  not 

optta  J  '  AWhough  a  poten’tial  conflict  provides  information  about  the  relative 
^  lifetimes  it  says  nothing  about  the  relative  positions  of  the  endpoints  of  these  hfetim  . 
two  hfetim  ,  ^  algorithm  that  does  not  use  this  information  is  one  in  which  the 

l^ngl^^eVifSmlTrbrft  the  order  of  assignment  176,.  Such  an  algorithm 
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will  also  produce  a  less  than  optimal  assignment  for  the  live  interference  graph  illustrated 
in  Figure  6.8. 

Conversely,  several  algorithms  that  make  use  of  the  relative  positions  of  the  end¬ 
points  of  lifetimes  are  optimal,  such  as  the  algorithm  based  on  the  creation  times  of  values; 
another  is  the  one  described  by  Ford  and  Fulkerson.  Other  examples  that  make  use  of 
the  relative  position  of  the  endpoints  are  variations  on  the  assignment  order  used  for  the 
algorithm  in  Figure  6.7.  Other  sequences  that  are  possible  are: 

1.  in  the  reverse  order  of  when  values  are  first  produced, 

2.  in  the  order  of  when  values  are  last  read,  or 

3.  in  the  reverse  order  of  when  values  are  last  read. 

Although  different  orders  are  used,  the  altered  algorithms  still  satisfy  the  optimality  condi¬ 
tion  and,  hence,  are  optimal. 

In  summary,  assigning  values  in  the  order  of  their  creation  time  always  produces 
an  optimal  register  assignment.  Although  this  does  not  always  produce  an  optimal  bus 
assignment,  assigning  values  in  this  order  guarantees  that,  when  a  value  is  processed,  the 
number  of  its  neighbors  in  the  live  interference  graph  assigned  to  a  particular  bus  is  also 
the  minimum  number  of  registers  needed  at  that  point.  This,  in  turn,  increases  the  chances 
of  assigning  fewer  buses.  Using  creation  time  instead  of  number  of  potential  conflicts  or 
lengths  of  lifetimes  has  implementation  benefits  as  weU  because  values  do  not  have  to  be 
sorted  into  an  order  different  from  that  of  the  given  execution  order.  Moreover,  because 
values  with  more  potential  conflicts  tend  to  have  longer  lifetimes  and,  hence,  are  executed 
earlier,  an  algorithm  based  on  creation  time  stiU  has  some  of  the  benefits  of  one  based 
on  a  count  of  potential  conflicts.  In  a  later  section,  I  will  present  some  quantitative  data 
showing  that  an  algorithm  based  on  creation  time  uses  slightly  fewer  buses  than  one  based 
on  a  count  of  potential  conflicts. 

I  have  intentionally  given  high-level  descriptions  of  my  algorithms  so  that  a  pro¬ 
grammer  has  as  much  freedom  in  implementing  these  algorithms  without  affecting  their 
stated  goals.  In  particular,  the  description  of  the  algorithm  for  assigning  registers  is  delib¬ 
erately  nonspecific  about  which  register  to  choose  when  there  are  several  candidates  (see 
Figure  6.7).  The  choice  of  a  register  is  left  as  an  implementation  detail  because  this  informa¬ 
tion  is  not  used  when  proving  that  the  algorithm  is  optimal  in  its  register  usage.  As  a  result, 
other  reasons  can  be  used  to  dictate  the  choice  of  a  register  without  affecting  the  number  of 
registers  assigned.  On  the  other  hand,  because  little  can  be  shown  mathematically  about 
bus  usage,  the  algorithm  for  assigning  buses  does  specify  which  bus  to  choose,  namely  the 
first  bus  that  was  ever  assigned  and  that  satisfies  the  conditions  for  avoiding  access  con¬ 
flicts  (see  Figure  6.6).  Because  the  data  I  produced  for  showing  how  many  buses  are  used 
is  based  on  this  algorithm,  altering  the  choice  of  bus  may  affect  how  many  buses  are  used. 
If  a  different  rule  for  choosing  a  bus  is  preferred  for  other  reasons,  further  experimentation 
is  needed  to  determine  the  impact  of  bus  choice  on  bus  usage. 
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6.3  Experimental  Framework 

In  this  section,  I  briefly  describe  the  performance  metric  and  the  methodology  I 
use  to  carry  out  the  studies  in  this  chapter.  Other  aspects  of  the  experimental  fr^ework, 
such  as  the  architectural  platform,  the  performance  tools,  and  the  workload  are  described 
in  Chapter  4,  Common  Experimental  Framework. 

6.3.1  Performance  Metric 

For  this  study  I  should  ideally  have  compared  the  execution  times  of  the  CRI  work¬ 
load  using  register  files  with  various  degrees  of  partitioning.  But  because  the  Y-MP  sim¬ 
ulator  does  not  model  a  partitioned  register  file,  and  because  I  did  not  have  access  to  the 
source  code  for  the  simulator,  I  compare  instead  the  number  of  conflict-free  assignments 
my  algorithm  produces  for  the  CRI  workload  using  various  configurations  of  vector  register 
files.  Unlike  execution  time,  a  count  of  conflict-free  assignments  cannot  indicate  how  much 
worse  an  assignment  with  conflicts  performs  relative  to  one  that  is  conflict-free.  Because 
this  performance  metric  is  less  informative  than  execution  time,  decisions  based  on  counts 
of  conflict-free  assignments  are  more  conservative  than  ones  based  on  execution  time. 

Nonetheless,  a  count  of  conflict-free  assignments  is  a  suitable  alternative  for  a 
performance  metric.  This  is  because  an  execution  order  for  a  dependence  graph  wiU  be 
executed  in  the  same  amount  of  time  using  a  partitioned  register  file  or  a  traditmnal  one 
as  long  as  an  assignment  for  that  configuration  is  conflict-free.  Furthermore,  this  metric 
can  be  coUected  at  compilation  time  unlike  at  execution  time,  which  requires  executing  the 
generated  code  on  the  Y-MP  simulator.  As  a  result,  a  count  of  conflict-free  assignments 
provides  a  quick  method  for  evaluating  the  effectiveness  of  the  algorithm  and  any  of  its 
variants. 


6.3.2  Methodology 

To  determine  the  minimum  number  of  buses  needed  to  effectively  use  the  functional 
units  of  the  Cray  Y-MP  vector  processor,  I  use  the  heuristic  algorithm  presented  in  the 
previous  section  and  compare  the  performance  of  its  assignments  for  a  different  number 
of  registers  and  buses.  Such  a  comparison  cannot  definitively  give  the  minimum  number 
of  buses  needed  because  this  assignment  problem  is  NP-hard;  finding  the  optimal  answer 
for  such  a  problem  is  practicaUy  infeasible  for  the  same  reasons  that  finding  the  optimal 
answer  for  the  scheduling  problem,  another  NP-hard  problem,  is  practically  infeasible  (see 
Section  5.2.2  in  Chapter  5,  Register  Usage  and  Instruction  Scheduling).  In  additional, 
a  heuristic  algorithm  has  the  advantage  that  it  will  increase  compilation  time  minimally 
when  it  is  used  in  a  compiler. 


6.4  How  Many  Buses? 

In  a  previous  section,  I  explained  how  to  assign  values  to  a  partitioned  register  file 
by  using  an  algorithm  that  attempts  to  minimize  the  number  of  buses  and  registers  assigned. 
In  this  section,  I  present  data  to  evaluate  how  well  a  partitioned  register  file  can  be  used 
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Figure  6.9;  Data  for  EvaJuating  the  UsabiUty  of  a  Partitioned  Vector  Register  File 

This  table  lists  the  percentage  of  dependence  graphs  for  which  my  Jgorithm  “ 

assignment  with  no  conliict.  for  the  indicaUd  number  of  buses  and  registers.  For  ““P'!' 
of  tL  dependence  graphs  in  the  CRl  workload,  m,  algorithm  uses  8  buses  or  less,  and  16  registers 
tfo  produce  f  coUt-free  assignment.  Because  one  o,  the  goals  --f ^ 
whether  the  number  of  buses  can  remain  at  8,  which  is  the  number  of  buses  the  Cray  Y-MP  curr  y 
implements,  I  have  highlighted  the  data  for  8  buses. 


and  to  determine  how  many  buses  and  registers  provide  an  acceptable  cost/performance 
configuration  for  the  Cray  Y-MP  vector  processor. 


6.4.1  Performance  Evaluation  of  Algorithm 
and  Partitioned  Register  Files 

Figure  6  9  tabulates  the  fraction  of  dependence  graphs  that  have  a  conflict-free 
assignment  using  partitioned  register  files  that  have  anywhere  from  4  to  128  registers  ^d 
Trol  4  to  32  busel  This  data  shows  that  the  count  of  conflict-free 

change  significantly  among  the  various  configurations  of  vector  register  files  with  16  or 
more  registers.  From  this  observation,  I  conclude  that  a  partitioned  register  ® 
comparably  to  a  traditional  vector  register  file  with  the  same  number  of  registers  as  long  as 

enough  that,  with  only  4  buses,  there  are  many  dependence  graphs 

that  do  not  have  a  conflict-free  assignment,  even  when  an  abundance  ‘s  provi  e  . 

Hence,  4  buses  are  clearly  not  enough  to  effectively  use  the  functional  units  of  the  Cray 
MP  processor.  Although  a  partitioned  register  file  with  4  buses  is  attractive  from  a 
viewpoint,  such  a  configuration  is  never  an  acceptable  choice  because  of  its  J 

poor  performance.  Moreover,  because  the  coverage  is  so  poor,  this  data  also  suggests  tha 
?  ^  1  a  Tiartitioned  register  file  with  4  buses  is  an  inadequate  design  for  a  vector 

professor  with  a  configuration  of  functional  units  comparable  to  ^ 

the  partitioned  register  files  for  the  Ardent  Titan  and  the  Fujitsu  VPU,  which  both  use 
4  buses,  may  be  ineffective  for  their  respective  configuration  of  functional  umts,  although 

further  investigation  is  needed  to  verify  this  hypothesis. 

On  tL  other  hand,  doubling  the  number  of  buses  from  4  to  8  results  in  a  large 
increase  in  the  number  of  conflict-free  assignments,  and  most  of  the  dependence  graphs  have 
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a  conflict-free  assignment  when  16  or  more  registers  are  provided.  Mor^ver  because  ther 
is  little  improvement  when  the  number  of  buses  is  doubled  from  8  to  16,  this  data  shows 
that  8  buses  are  enough  to  effectively  use  the  functional  umts  of  the  Cray  Y-MP  processor^ 
Although  no  assignment  algorithm  can  guarantee  to  use  the  immmum  number  of 
buses,  I  stated  in  a  previous  section  that  assigning  values  in  the  order  of  their  creation  time 
instead  of  by  the  number  of  their  potential  conflicts  will  tend  to  use  fewer  buses.  This  is 
because  assigning  values  by  creation  time  guarantees  that  at  least  the  immmum  number 
of  raters  per  bus  wUl  be  used,  which  in  turn  increases  the  chances  of  using  fewer  buses. 
Quantitative  evidence  of  this  is  given  in  Figure  6.10,  which  indicates  the  number  of  buses 
assigned  by  these  two  algorithms  for  a  partitioned  register  file  with  8  buses  and  1,  2,  or 
4  registers  sharing  a  bus.  Regardless  of  the  amount  of  partitiomng,  assigmng  on  the  b^is 
of  creation  time  uses  fewer  than  8  buses  for  a  greater  fraction  of  the  dependence  paph 
in  the  CRT  workload.  Although  this  is  not  a  crucial  difference-both  algorithms  produce 
a  comparable  number  of  conflict-free  assignments  for  8  buses-assigmng  values  by  creation 
time  is  nonetheless  the  better  choice  not  only  because  it  uses  shghtly  fewer  buses,  but  also 

because  it  is  easier  to  implement. 

6.4.2  Making  a  Stronger  Case  for  8  Buses  and  16  Registers 

From  the  data  in  Figure  6.9,  it  is  clear  that  8  buses  are  enough,  but  the  ap¬ 
propriate  number  of  registers-16  or  32-is  less  obvious.  Because  these  numbers  do  not 
reflect  execution  time,  there  is  no  way  of  knowing  whether  increasing  the  number  of  reg¬ 
isters  from  16  to  32  would  result  in  a  significant  improvement  in  performance.  A  stronger 
case  for  16  registers  and  8  buses,  however,  could  be  made  by  increasing  the  number  o 
conflict-free  assignments  for  this  configuration.  To  do  this  in  a  systematic  fashion,  in  this 
subsection  I  examine  in  greater  detail  the  seven  dependence  graphs  for  which  my 
generates  an  assignment  with  conflicts  for  any  clues  that  would  result  in  more  conflict-free 
assignments.  Once  I  find  a  method  that  produces  a  conflict-free  assignment  for  one  of  these 
dep^dence  graphs,  I  use  that  method  on  the  rest  of  the  dependence  graphs  to  see  whether 

more  conflict-free  assignments  are  produced  overall. 

For  two  of  the  seven  dependence  graphs,  my  assignment  algorithm  uses  inore  than 

8  buses  for  different  reasons.  For  one  of  these  dependence  graphs,  more  than  8  buses  are 
assigned  because  my  algorithm  doesn’t  happen  to  produce  the  best  assignment,  emph^izing 
the  heuristic  nature  of  this  algorithm.  But  assigning  values  to  buses  in  the  order  of  their 
final  reads  or  death  times,  rather  than  their  creation  times,  does  produce  a  conflict-fr^ 
assignment  for  8  buses  and  16  registers.  Using  death  time  instead  of  creation  time  as  he 
basis  for  assignment  for  all  the  graphs  in  the  CRl  workload,  however,  does  not  change  the 
overall  number  of  conflict-free  assignments  when  using  8  buses  and  16  registers.  This 
because  both  these  algorithms  are  based  on  heuristics  and  neither  wiU  always  produce  the 
best  assignment;  although  assigning  values  by  death  times  produces  a  better  assignment  for 
some  dependence  graphs,  assigning  values  by  creation  times  produces  a  better  assignment 

for  other  dependence  graphs.  ,  ,  ,  v,  ac 

More  than  8  buses  are  assigned  for  the  second  dependence  graph  because,  as 

I  discovered  upon  closer  examination,  its  associated  active  interference  graph  contains  a 
complete  graph  with  9  vertices.  Although  no  assignment  algorithm  can  use  fewer  than 
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Figure  6.10:  Comparing  the  Effectiveness  of  Two  Assignment  Algorithms 

sharing  a  bus.  rartinilar  aleorithm  uses  b  or  fewer  buses  for  p  percent 

assignment  using  a  particular  algont  m  or  a  p  .  p:„ypg  55  and  the  dashed  curves 

the  middle  two  curves  are  the  distributions  when  assigning  8  buses  and  lb  registers, 
nair  of  distributions  are  for  8  buses  and  32  registers. 

Th«e  distributions  show  that,  reg.rdl.ss  of  the  .mount  of  partitromug,  ^ 

creation  time  uses  fewer  thm,  8  buses  for  the  most  number  of  dependence  graphs  in  CRI  work  . 
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9  buses  for  this  interference  graph,  interference  graphs  correspond  to  a  particdar  execution 
order  of  a  dependence  graph,  and  a  different  execution  order  allows  my  algonthm  to  produce 
a  conflict-free  assignment  for  this  dependence  graph  using  8  buses.  This  different  execution 
order  is  generated  by  a  slight  modification  to  the  scheduling  algorithm,  in  which  operations 
are  processed  in  their  statement  order  rather  than  in  the  order  of  their  path  ^stance. 
Although  changing  the  scheduling  algorithm  may  affect  the  execution  time  of  the  loops, 
this  is  only  relevant  if  the  change  increases  the  overall  number  of  conflict-free  assignments 
when  using  8  buses  and  16  registers.  As  with  the  previous  attempt,  the  heuristic  nature  of 
the  scheduling  and  assignment  algorithms  prevents  this  method  from^ways  producing  the 
best  results.  In  fact,  applying  this  change  to  all  the  dependence  graphs  reduces  the  over^ 
number  of  conflict-free  assignments  by  one,  thereby  suggesting  poorer  performance  for  the 

entire  CRJ  workload. 

Although  my  algorithm  uses  more  than  8  buses  for  two  dependence  graphs,  it  gen¬ 
erates  assignments  with  conflicts  for  the  other  five  dependence  graphs  because  their  minimal 
register  requirement  is  greater  than  16.  For  these  five  dependence  graphs,  a  conflict-free 
assignment  for  8  buses  and  16  registers  is  possible  only  if  their  minirnal  register  require¬ 
ment  is  reduced.  Because  this  requirement  is  associated  with  a  particular  execution  or  er, 
one  method  of  reducing  it  is  to  use  a  different  scheduling  algorithm  to  generate  different 
execution  orders  for  the  same  set  of  dependence  graphs.  Using  the  modified  scheduhng  go- 
rithm  described  in  the  previous  paragraph  thus  reduces  the  minimum  register  requiremen 
to  less  than  16  for  two  of  the  five  dependence  graphs.  Unfortunately,  as  I  mentioned  in  the 
previous  paragraph,  the  total  number  of  conflict-free  assignments  decreases  by  one.  Using 
execution  orders  generated  from  other  variations  of  the  scheduling  algorithm  also  produces 
comparable  results.  Hence  this  alternative  is  not  an  improvement. 

Another  approach  to  reducing  the  minimal  register  requirement  is  to  allow  register 
assignments  of  the  form  Vi<-Vi  op  Vj  or  Vi<-Vj  op  Vi,  where  two  different  vdues  an 
operand  and  a  result,  are  assigned  to  the  same  register  Vi.  For  the  sake  of  brevity,  from  this 
point  on  I  use  Vi<-Vi  op  Vj  to  represent  both  forms  of  this  type  of  assignment  Although 
the  hardware  allows  such  assignments,  my  modeUng  of  assignment  does  not  because  aU 
values  associated  with  an  operation  are  considered  to  be  simultaneously  live  and  hence 
are  assigned  to  different  registers.  Allowing  simultaneously  live  values  to  share  a  register 
could  reduce  the  minimal  register  requirement  and  perhaps  increase  the  number  of  confbct- 
free  assignments  when  using  8  buses  and  16  registers.  Although  including  a  feature  that 
recognizes  when  two  simultaneously  live  values  can  actually  share  a  register  comphcates 
the  algorithm  somewhat,  this  additional  complexity  is  justified  nonetheless  if  substantiaUy 

fewer  registers  and  buses  can  be  used  as  a  result. 

An  assignment  of  the  form  Vi<-Vi  op  Vj  should  only  be  used  when  there  is  no 

possibility  of  causing  a  register  or  bus  conflict.  Because  a  value  is  produced  by  an  operation, 
such  as  a  load  or  multiply,  no  conflicts  will  occur  between  two  simultaneously  hve  values, 
y  and  z,  only  if  their  associated  operations,  opj,  and  op, ,  satisfy  the  conditions  hsted  in 
Figure  6.11.  The  relative  positions  of  two  such  values  in  the  live  and  active  interference 

graphs  look  like  this: 
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CON DITION _ REASON  FOR  CON^rriON___ 


1.  operation  op.  is  dependent  on 
operation  opy 

minimum  requirement  for  using 
Vi<-Vi  op  Vj 

2.  op.  is  the  final  read  ol  the 
value  y 

to  avoid  register  conflicts 

3.  op.  is  not  executed  in  the  same 
chime  as  op„ 

to  avoid  bus  conflicts  due  to 
writes 

4.  any  operation  reading  value  z 
does  not  execute  in  the  same 
chime  as  op. 

to  avoid  bus  conflicts  due  to 
reads 

Figure  6.11:  Conditions  for  Using  Vi<-Vi  op  Vj 

This  table  lists  the  conditions  that  identify  two  simultaneously  live  values,  y  and  r,  that  can 
be  used  in  assignments  of  the  form  Vi<-Vi  op  Vj  or  Vi<-Vj  op  Vi  without  causing  any  register 
or  bus  conflicts.  Because  a  value  is  produced  by  an  operation,  such  as  a  load  or  multiply,  these 
conditions  actually  apply  to  the  value’s  associated  operations,  opy  and  op,. 


The  important  aspects  of  this  diagram  are  that  the  last  read  of  the  v^ue  y  (indicated  by 
a  solid  interval)  occurs  simultaneously  with  the  write  of  the  value  r  (indicated  by  a  dashed 
interval),  and  all  other  accesses  of  either  value  occur  at  other  times.  Although  this  diagram 
illustrates  the  last  three  conditions  in  Figure  6.11,  it  does  not  indicate  that  operation  op.  is 
dependent  on  operation  op„  information  that  is  kept  in  the  associated  dependence  graph. 

If  the  conditions  above  are  satisfied,  the  value  z  can  be  assigned  to  the  same  register 
assigned  to  the  value  y  without  causing  any  register  or  bus  conflicts,  and  the  instructions 
for  their  associated  operations  look  like  this: 


Vi  <- 


opy 


Vi  <-  Vi  op.  ••• 

Two  values  that  satisfy  the  above  conditions  can  be  merged  in  the  associated  interference 
graphs  and  treated  as  a  single  value  for  the  purposes  of  assignment.  Thus,  to  incorporate 
assignments  of  the  form  Vi<-Vi  op  Vj  into  my  algorithm,  the  interference  graphs,  a^ter  be¬ 
ing  built,  are  modified  by  combining  values  that  satisfy  the  above  conditions.  The  modified 

graphs  are  then  used  as  before  to  assign  buses  and  registers. 

Although  combining  live  values  decreases  the  number  that  are  simultaneously  five 
in  the  interference  graph,  this  merging  also  has  the  negative  effect  of  increasing  the  connec¬ 
tivity  in  the  active  interference  graph.  The  data  in  Fignre  6.12  shows  that  the  reduction 
in  the  minimal  register  requirement  is  insufflcient  grounds  for  countering  ™- 

pact  of  this  increased  connectivity.  The  first  set  of  data  in  graph  (a)  shows  that  such  an 
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assignment  can  be  nsed  fairly  often,  especiaUy  for  the  larger  dependence  graphs^ For 
ample  from  10  to  20  values  can  be  merged  for  nine  dependence  graphs  and  102  values  for 
one  dependence  graph.  Although  this  first  set  of  data  looks  promising,  the  second  set  in 
graph  m  shows  that  the  resultant  reduction  in  the  number  of  registers  is  not  substantial 
At  best,  the  minimal  register  requirement  is  reduced  by  3  for  two  dependence  graphs;  most 
others  show  no  change  or  at  most  a  reduction  of  1  register. 

Finally,  the  third  set  of  data  in  table  (c)  shows  that  using  Vi<-Vi  op  Vj  actu^y 
decreases  the  number  of  conflict-free  assignments  when  8  buses  and  16  or  more  regis  ers 
are  uses  as  well  as  and  when  4  buses  and  8  or  more  registers  are  used.  This  because 
merged  values  result  in  longer  lifetimes  which  in  turn  reduce  the  number  of  repsters  tha 
become  avaUable  for  re-assignment  at  any  given  time.  Furthermore,  aUowmg  the  re-use  of 
a  register  in  this  fashion  is  never  better  than  always  using  three  distinct  registers  for  the 
values  that  are  read  and  written  by  an  operation.  Although  there  are  other  reasons  for  using 
Vi<-Vi  op  Vi  (for  example,  in  the  vector  version  of  a  scalar  reduction  where  the  operand 
and  result  actually  represent  the  same  value),  I  conclude  from  this  set  of  data  that  using 
such  an  assignment  does  not  improve  the  register  nor  bus  usage,  and  hence  there  is  h 
reason  to  incorporate  such  a  capability  into  the  assignment  algorithm. 

A  third  approach  to  reducing  the  minimal  register  requirement  is  to  spiU  registers. 
In  other  words,  rather  than  combining  values  in  the  interference  graphs  as  w^ 
preceding  method,  a  value  is  split  into  two  when  registers  are  spiUed  to  reduce  minim 
register  requirement.  However,  a  judicious  choice  of  which  values  to  spill  and  when  to 
so  is  necessary  if  this  technique  is  to  work.  Although  the  minimal  register  ^ 

be  reduced  by  choosing  candidates  from  the  largest  set  of  values  that  are  bve  at  the  same 
time,  the  actual  act  of  spilling  a  value  requires  using  a  bus,  and  hence  adds  more  edges  to 
the  active  interference  graph.  In  order  to  produce  a  conflict-free  assignment,  the  reduction 
in  register  conflicts  must  be  more  than  the  potential  increase  in  bus  conflicts. 

Data  from  the  previous  chapter  suggests  that  spilling  registers  is  a  promising  ap¬ 
proach.  Even  though  five  of  the  dependence  graphs  have  a  minimal  register  requiremen 
greater  than  16, 1  showed  in  the  previous  chapter  that  with  a  traditional  vector  register  flle 
16  are  enough  to  improve  performance  and  that  adding  more  registers  results  in  only  a  nom¬ 
inal  improvement  in  performance.  This  is  because  the  algorithm  for  register  assignment  in 
the  cft77  compiler  generates  code  for  register  spilling  to  accommodate  the  requirements  ol 
the  larger  dependence  graphs.  Moreover,  because  a  vector  architecture  supports  flne-grain 
parallehsm,  these  extra  instructions  can  execute  in  parallel  with  the  original  instructions 
without  increasing  execution  time. 

Incorporating  register  spiUs  into  my  algorithm  involves  changing  the  dependence 
graph  to  show  which  values  are  to  be  spilled  and  when  the  extra  spiU  operations  are  to 
he  executed.  However,  this  entails  extensive  changes  because  more  information  must  be 
integrated  than  what  was  done  for  the  other  changes  I  have  described.  For  example,  using 
a  different  order  for  scheduling  or  assigning  is  an  easy  change  because  it  does  not  require 
examining  the  interaction  among  sets  of  values.  A  more  complicated  change  is  Rowing 
assignments  of  the  form  Vi<-Vi  op  Vj,  which  requires  examining  interaction  between 
pairs  of  dependent  values  before  the  interference  graphs  can  be  suitably  modified.  Intr 
ducing  register  spiUs  is  even  more  complex  because  it  requires  examining  the  interaction 
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Figure  6.12:  Impact  on  Assignments  When  Using  Vi<-Vi  op  Vj 

The  figures  above  show  three  sets  of  data  for  evaluating  the  effectiveness  of  allowing  assignments 
of  the  form\i<-Vi  op  Vj.  whereby  simultaneously  live  values  that  have  specific  characteristics 

be  assiened  to  the  same  register  without  causing  any  conflicts. 

Because  two  such  values  are  treated  as  a  single  value  for  the  purpos^  of  assignment,  the  first 
histoeram  graph  (a),  indicates  how  often  such  an  assignment  can  be  used  in  a  dependence  gr  p  . 
aTiowS^  live  values  to  share  a  register  could  reduce  the  minimal  register  requirement, 

and  the  second  histogram,  graph  (b).  shows  the  size  of  this  reduction  for 

T^-  11  nf  data  table  fcl  eives  the  fraction  of  dependence  graphs  for  wnicn  my 

Vi<-Vi  op  Vi  to,  tho  Moated  ou.be, 

m  tet'?t  o'f  data  6ho»s  that  assignments  of  the  form  VK-Vi  op  Vj  can  be  used  fairly  otten^ 
Despite  this  promising  result,  the  second  set  of  data  shows  that  the  rninimal  register 
to,  a  dependence  graph  is  not  reduced  substantially.  Finally,  the  third  set  of  ^  * 
using  Vi<-Vi  op  Vj  does  no  better  than  not  using  this  type  of  assignment,  and  actually  deceases 
the  number  of  conflict-free  assignments  when  more  than  one  register  shares  a  bus  in  a  partitioned 
register  file  with  4  or  8  buses  (indicated  by  the  percentages  in  italics). 
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among  several  values  that  are  live  or  active  at  the  same  time  in  order  to  choose  candidates 
for  spilling  and  to  modify  the  dependence  graph  appropriately.  Because  of  such  ^tensive 
changes  and  because,  as  I  argue  in  the  next  subsection,  using  32  registers  with  8  buses  is  a 
reasonable  choice  for  now,  I  omit  a  quantitative  evaluation  of  this  method,  leavmg  such  an 
undertaking  for  the  future. 

6.4.3  Choosing  a  Partitioned  Register  File 

In  summary,  the  experiments  in  the  previous  subsection  show  that  increasing  the 
number  of  conflict-free  assignments  for  8  buses  and  16  registers  in  a  systematic  fashion  is  a 
difficult  task.  Although  the  data  in  Figure  6.9  is  unable  to  provide  a  strong  case  for  either 
16  registers  or  32  registers  when  using  8  buses,  the  relative  priorities  of  performance  and 
cost  also  influence  the  choice  of  which  configuration  to  implement. 

If  cost  is  more  important  than  performance,  then  this  data  suggests  that  a  configu¬ 
ration  of  8  buses  and  16  registers  is  the  better  choice.  Because  different  heuristics  produce  a 
conflict-free  assignment  for  different  dependence  graphs,  performance  could  be  improved  by 
sequentially  applying  these  techniques  until  a  conflict-free  assignment  is  found.  The  draw¬ 
back  to  this  approach  is  a  potential  increase  in  compilation  time.  However,  experimental 
evidence  shows  that  over  90%  of  the  cases  would  need  to  use  only  the  original  algorithm,  and 
doing  so  should  not  significantly  increase  the  average  compilation  time.  Another  alternative 
is  to  include  register  spilling  into  the  assignment  algorithm,  although  further  investigation 

is  needed  to  evaluate  the  effectiveness  of  this  technique. 

On  the  other  hand,  if  performance  is  more  important  than  cost,  then  the  better 
choice  is  a  configuration  of  8  buses  and  32  registers.  Based  on  the  cost/performance  analysis 
in  Figure  6.2  (on  page  115),  however,  this  configuration  is  not  an  ideal  tradeoff  between 
cost  and  performance.  In  fact,  a  partitioned  register  file  with  8  buses  and  32  registers  is 
sHghtly  more  costly  to  implement  than  a  traditional  one  with  16  buses  and  16  registers. 
Nonetheless,  three  factors  favor  the  partitioned  organization. 

First,  using  32  registers  results  in  a  greater  improvement  in  performance  than 
when  using  16  registers  (18%  versus  9%).  Moreover,  a  partitioned  register  file  with  8  buses 
and  32  registers  has  slightly  more  conflict-free  assignments  than  a  traditional  register  file 
with  16  buses  and  16  registers. 

Second,  a  register  file  that  provides  more  registers  rather  than  more  buses  is  better 
able  to  accommodate  a  wider  range  of  dependence  graphs.  Because  the  number  of  functional 
units  provided  by  hardware  limits  the  number  of  values  that  can  be  simultaneously  active, 
minimal  bus  requirements  are  influenced  more  by  the  configuration  of  functional  units  and 
less  by  a  program’s  characteristics.  In  other  words,  a  dependence  graph  with  even  an 
inordinate  number  of  operations  is  unlikely  to  require  more  than  8  buses  because  the  number 
of  simultaneously  active  values  is  limited  by  the  number  of  vectorizable  operations  that  can 
execute  in  parallel,  which  in  turn  is  limited  by  the  number  of  functional  units.  Based  on  the 
data  in  Figure  6.9,  8  buses  appear  to  be  enough  for  the  current  configuration  of  functional 

units  in  the  Cray  Y-MP. 

In  contrast  to  bus  requirements,  the  need  for  a  register  depends  on  how  operations 
interact  with  each  other.  This  means  that  minimal  register  requirements  are  influenced  less 
by  what  hardware  provides  and  more  by  the  dependence  patterns  in  a  program.  Providing 
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Figure  6.13:  Cost  Analysis  o[  Vector  Register  Kies  with  Varying  Vector  Lengths 

This  table  gives  the  relative  difference  in  chip  count  ot  some  traditional  and  partitioned  ™ctor 
registl  files  when  varying  the  numhe,  of  element,  per  vector  register,  wh.chjs  also  known  sa  the 
ZTuZ-  This  cost  Lysis  is  an  extension  of  the  one  given  in  Fi^r. 
constant  vector  length  of  64,  No  increase  in  chip  count  indtcaus  ‘ 

number  ot  registers  and  buses  is  the  same  as  the  current  .mplementat.on  of  a  Cray  Y-MP  processor, 
which  uses  a  total  of  512  registers  and  8  buses.  _ 


32  registers  instead  of  16,  but  at  the  same  hardware  cost,  allows  a  partitioned  vector  regis  er 
file  to  more  easily  accommodate  larger  dependence  graphs  which,  although  un  e  y 
require  more  than  8  buses,  are  Ukely  to  require  more  than  16  registers. 

The  third  factor  that  favors  the  partitioned  organization  is  that  the  of 
menting  8  buses  and  32  registers  can  be  lessened  by  reducing  the  vector  length.  Whereas  the 
SperLma^  analysis  in  Figure  6.2  assumes  that  the  vector 

at  64  Figure  6.13  shows  how  the  increase  in  cost  is  affected  by  the  vector  length.  When 
the  vector  length  is  32,  a  partitioned  register  file  with  8  buses  and  32  registers  is  in  fact, 
iL  costl„  to  Lpicment  than  a  traditional  one  with  16  buses  and  16  legistos.  A  W)'" 
..durtion  in  vector  length  to  16  makes  the  partitioned  organization  even  more  attractive. 
However,  as  explained  at  the  beginning  of  this  chapter,  sustainable  f J’' 
Lersely  affected  if  the  vector  length  is  shortened  too  much.  As  a  result,  ^'>“”‘8^ 
analysis  looks  promising,  further  studies  are  needed  to  measure  how  shorter  vector  lengths 

affect  performance. 

6.5  Related  Work 

I  presented  one  algorithm  for  assigning  registers  and  one  for  assigning  buses 

whereas  most  researchers  emphasize  only  register  assignment  Two  LTavidson’ 

Eisenbeis,  Jalby,  and  Lichnewsky  as  well  as  Mangione-Smith,  Abrah^.^d  David 
have  both  presented  assignment  algorithms  tor  vector  registers  (33,  81).  Although  these 
?wo  algorithms  are  optimal  in  that  they  always  assign  the  fewffit  nuinber  of 
for  a  given  execution  order,  a  polycyclic  vector  scheduler  produces  the  execu  mn  o  dec.  In 
rontrL  mv  assignment  algorithm,  which  is  also  optimal,  uses  an  execution  order  produced 
by  a  simple  vector  scheduler.  Both  the  assignment  and  scheduling  algorithms  I  use  are  ess 
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complex  than  those  developed  for  these  other  studies. 

Although  my  algorithm  is  used  for  a  vector  architecture  in  this  chapter,  it  can  also 
be  used  for  other  architectures.  The  underlying  architecture  affects  when  values  are  live 
and  active,  information  that  is  used  to  construct  the  interference  graphs.  Once  constructed, 
however,  the  assignment  of  values  to  registers  can  be  done  using  the  algorithm  I  described. 
In  general,  assignment  algorithms  for  one  type  of  architecture  can  often  be  adapted  to  other 
architectures.  In  contrast,  algorithms  for  different  types  of  progra.m  fragments  cannot  be 
transformed  as  easily  and  as  a  result,  algorithms  presented  in  the  literature  can  be  catego¬ 
rized  by  the  type  of  program  fragment  they  operate  on  [96].  In  the  following  paragraph,  I 
present  this  categorization  of  assignment  algorithms  to  contrast  my  algorithm  with  others. 

One  category  of  algorithms  operates  on  a  single  expression  at  a  time,  where  the 
expression  is  represented  by  a  tree,  which  is  a  dependence  graph  with  no  common  subex¬ 
pressions  [87,  98,  99,  3].  Another  class  of  algorithms,  which  are  known  as  global  register 
assigners,  operates  on  the  entire  program,  which  is  represented  by  a  control  flow  graph 
whose  vertices  are  basic  blocks  [9,  18).  The  third  category  of  algorithms,  which  are  known 
as  local  register  assigners,  operates  on  a  single  basic  block  at  a  time.  This  type  of  program 
fragment,  which  can  be  represented  by  a  dependence  graph  with  common  subexpressions, 
falls  between  the  types  for  the  other  two  classes  of  algorithms.  Algorithms  for  local  register 
assignment  can  be  further  classified  into  two  sub-categories  based  on  whether  the  execution 
order  is  fixed  [100]  or  not.  The  algorithm  I  presented,  as  well  as  the  two  assignment  algo¬ 
rithms  for  polycyclic  vector  schedules  mentioned  above,  fall  into  the  second  sub-category. 
In  this  dissertation,  for  the  sake  of  brevity,  I  have  used  the  terms  local  register  assignment 
and  register  assignment  to  refer  to  this  sub-category  rather  than  the  whole  category. 

Because  part  of  my  investigation  was  to  determine  the  minimum  numbers  of  reg¬ 
isters  to  implement  in  hardware,  I  designed  my  algorithm  to  use  the  fewest  registers  for  a 
given  execution  order.  For  practical  register  assignment,  however,  the  number  of  registers 
in  hardware  is  fixed  and  when  the  minimal  register  requirement  of  an  execution  order  ex¬ 
ceeds  that  number,  extra  memory  references  must  be  generated  to  spill  registers.  Hence,  in 
addition  to  assigning  values  to  registers,  an  algorithm  for  a  production  compiler  must  also 
choose  an  appropriate  register  to  spill  so  as  to  minimize  the  number  of  extra  memory  refer¬ 
ences.  Not  surprisingly,  most  algorithms  for  local  register  assignment  handle  the  problem 
of  register  spilling.  Although  my  algorithm  for  assigning  registers  does  not  directly  address 
this  problem,  it  can  be  easily  extended  to  do  so.  I  discuss  the  merits  of  doing  this  when  I 
describe  future  studies  in  Section  7 .2  of  Chapter  7 ,  Concluding  Remarks. 

As  part  of  my  presentation,  1  gave  a  proof  that  my  algorithm  always  assigns 
the  minimum  number  of  registers.  Researchers  rarely  examine  this  aspect  in  much  detail, 
focusing  instead  on  minimizing  the  number  of  extra  memory  references.  An  exception  is 
Freiburghouse,  who  describes  an  optimal  algorithm  that  differs  from  mine  in  two  ways 
[44].  First,  rather  than  constructing  a  live  interference  graph,  Freiburghouse  computes 
usage  counts,  which  are  the  number  of  times  each  value  is  referenced.  The  second  and 
more  important  difference  is  how  the  optimality  of  the  algorithm  is  proven.  Although 
Freiburghouse  did  not  give  a  proof,  he  refers  to  Cries  who  independently  developed  this 
same  algorithm  to  allocate  temporary  variables  to  memory  locations  on  the  stack  rather 
than  to  registers  [51,  pages  299-304].  The  optimality  of  this  algorithm,  which  is  actually 
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attributed  to  Dantzig  and  Reynolds  [27],  relies  on  a  stack  to  keep  track  of  available  registers 

and  does  not  allow  any  choice  for  a  register. 

In  contrast,  the  proof  showing  the  optimality  of  my  algorithm  emphasizes  the  im¬ 
portance  of  assigning  values  in  order  of  their  creation  times  and  does  not  rely  on  which 
register  to  choose  when  there  are  several  candidates.  Because  of  the  similarities  between 
my  algorithm  and  Freiburghouse’s,  this  proof  can  also  be  used  to  prove  the  optimahty  of 
his  algorithm,  hence  removing  the  necessity  for  a  stack  to  keep  track  of  available  registers. 

A  consequence  of  this  proof  is  that  a  register  can  be  chosen  for  other  reasons.  For  exainple, 
this  extra  degree  of  freedom  allows  available  registers  to  be  assigned  using  a  first-in,  first- 
out  queue  rather  than  a  last-in,  first-out  stack  without  affecting  the  optimality  of  either 
algorithm.  Assigning  registers  in  such  a  round-robin  fashion  increases  the  probabihty  that 
the  same  register  wiU  be  assigned  to  values  whose  lifetimes  are  teinporally  far  apart.  Tbs 
in  turn  makes  the  register  assignment  less  sensitive  in  terms  of  performance  to  mismatches 
between  what  the  compiler  thinks  the  hardware  does  and  how  the  hardware  actually  be¬ 
haves.  1.  1  • 

As  part  of  my  algorithm,  I  modeled  the  assignment  of  registers  as  a  graph  coloring 

problem.  This  technique  has  been  applied  to  global  register  assignment  by  Ershov  et  al. 
[35  36],  who  uses  the  term  incompatibility  graph  instead  of  interference  graph  ^d  by 
ChLitin  et  al.  [18],  who  coined  the  term  interference  graph.  Assigning  registers  globally  and 
locally  are  two  different  problems.  The  interference  graph  for  global  register  assignment 
can  be  described  as  an  intersection  graph  [78],  which  is  an  arbitrary  graph,  whereas  the  one 
for  local  register  assignment  is  an  interval  graph,  which  has  a  special  structure.  Althoug 
other  researchers  have  also  recognized  the  fact  that  an  interval  graph  arises  when  ^signing 
registers  locally  [11],  I  am  unable  to  find  a  reference  that  specifically  describes  the  use  of 
graph  coloring  for  local  register  assignment. 

Unlike  the  other  applications  of  graph  coloring  for  register  assignment,  I  use  not 
one  but  two  interference  graphs  at  the  same  time.  These  two  graphs  arose  from  the  special 
organization  of  the  partitioned  vector  register  file.  The  live  interference  graph  I  use  is  similar 
to  the  interference  graph  constructed  for  global  register  assignment  in  that  hveness  is  the 
basis  for  interference.  However,  the  range  of  a  live  value  for  global  register  assignment  is 
slightly  different  because  a  value’s  life  can  span  multiple  basic  blocks,  thus  producing  an 
intersection  graph  rather  than  an  interval  graph.  The  active  interference  graph  I  use  is  a 

new  construction.  •  u  •  r 

The  inspiration  for  the  development  of  this  chapter’s  assignment  algorithm  is,  o 
course,  the  partitioned  vector  register  file.  In  addition  to  reducing  the  cost  of  implementing 
more  vector  registers,  this  configuration  is  also  an  inexpensive  method  for  implementing 
a  reconfigurable  vector  register  file,  as  is  the  case  in  the  Ardent  Titan  and  the  single-chip 
vector  processor  by  Fujitsu  [31,  64].  Although  the  Fujitsu’s  FACOM  vector  processors  also 
provide  a  reconfigurable  register  file,  descriptions  of  its  implementation  do  not  indicate 
whether  or  not  it  is  a  partitioned  vector  register  file  [85].  Nonetheless,  despite  the 
of  these  commercial  implementation,  I  do  not  know  of  any  publication  that  describes  the 
details  of  an  algorithm  that  can  effectively  assign  values  to  a  partitioned  register  file. 

In  addition  to  developing  the  assignment  algorithm,  I  also  examined  the  perfor¬ 
mance  impact  of  partitioned  register  files  and  concluded  that  such  configurations  are  not 
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necessarUy  an  impediment  to  performance.  There  are  relatively  few  studies  that  examine 
the  appropriate  balance  between  number  of  registers  and  number  of  buses.  The  most  re¬ 
lated  ones  are  those  that  investigated  the  performance  impact  of  different  organizations  of 
register  files  in  VLIW  or  superpipelined  scalar  architectures  [105,  15,  37].  These  studies 
compared  the  performances  of  monolithic  and  distributed  register  files,  whereas  I  compared 
the  performances  of  various  configurations  of  a  partitioned  register  file,  an  organization  that 
falls  between  the  other  two  with  respect  to  accessibility  and  hardware  cost.  A  conclusion 
that  could  be  drawn  from  these  studies  is  that,  in  order  to  get  the  best  performance,  the 
number  of  result  buses  must  be  equal  to  the  number  of  register  sets  in  a  distributed  register 
file.  However,  the  amount  of  parallelism  supported  by  the  architectures  in  these  studies 
is  small  in  comparison  to  the  amount  supported  by  the  vector  architecture  I  used.  Once 
enough  parallelism  is  supported  by  the  entire  architecture  of  a  processor,  I  showed  that  the 
number  of  buses  can  be  less  than  the  number  of  register  banks  without  adversely  affecting 
performance. 


6.6  Summary 

In  this  chapter,  I  examined  in  greater  detail  the  tradeoff  between  improved  per¬ 
formance  and  increased  cost  when  implementing  more  vector  registers.  In  a  multi-chip 
implementation,  such  as  the  Cray  Y-MP  processor,  the  number  of  chips  is  a  good  measure 
of  hardware  cost.  Using  this  metric,  I  showed  that  doubling  the  number  of  registers  from  8 
to  16  results  in  a  25%  increase  in  the  number  of  chips  when  implemented  in  a  straight¬ 
forward  fashion,  while  providing  only  a  9%  improvement  in  performance.  The  reason  for 
this  high  cost  is  that  the  implementation  of  a  vector  register  file  actually  consists  of  register 
banks,  which  store  data,  and  interconnections,  which  link  register  banks  to  functional  units. 
Doubling  the  number  of  registers  in  a  straightforward  fashion  requires  doubling  both  these 
types  of  components. 

Because  the  size  of  an  interconnection  is  determined  by  the  number  of  buses  at¬ 
tached  to  them,  an  obvious  hardware  solution  that  improves  the  tradeoff  between  increased 
cost  and  performance  gain  is  to  have  more  than  one  register  share  a  bus.  This  new  con¬ 
figuration,  which  I  call  a  partitioned  vector  register  file,  is  another  example  of  partitioning 
a  register  file  to  reduce  cost  at  the  expense  of  increased  restrictions  on  accessing  registers. 
Just  as  a  traditional  vector  register  file,  which  is  also  partitioned,  is  less  costly  to  implement 
than  a  monolithic  one  with  the  same  number  of  ports,  a  partitioned  vector  register  file  is 
less  costly  to  implement  than  a  traditional  one  with  the  same  number  of  vector  registers. 

Although  the  restricted  access  to  registers  is  not  a  problem  in  a  vector  register  file, 
the  restricted  access  to  vector  registers  in  a  partitioned  vector  register  file  does  present  a 
challenge.  Figure  6.14  presents  an  overview  of  the  algorithm  I  developed  that  circumvents 
the  restricted  access  of  this  new  configuration.  My  algorithm  has  two  goals.  One  goal  is  to 
avoid  access  conflicts,  such  as  WAR  or  WAW  register  dependences,  which  would  degrade 
performance.  This  is  accomplished  by  always  assigning  values  that  are  active  at  the  same 
time  to  different  buses  and  assigning  values  that  are  live  at  the  same  time  to  different 

^Section  2.2.2  (pages  11  to  16  in  Chapter  2,  Fundamentals  of  Vector  Architectures)  explains  the  organi¬ 
zational  differences  between  monolithic,  partitioned,  and  distributed  registers  files. 
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showed  thet  assigning  values  iu  order  of  ceatrou  I  Jevel- 

FiuaBy,  I  presented  data  to  f^freEve  Berthe  Y-MP  sto¬ 
oped  and  to  choose  a  partitioned  regis  er  performance  metric  is  the  number  of 
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onties  of  performance  and  cos  .  improving  performance,  such  as  combining 

16  registers  is  more  appropriate  and  m  cnillinff  should  be  investigated 

although  further  studies  are  needed  to  measure  the  resultant  impact  on  perform  . 


Figure  6.14:  Algorithm  for  Assigning  Values  to  a  Partitioned  Re^ster  File 

This  figure  is  an  overview  of  the  assignment  algorithm  I  developed  for  using  a  partitioned 
register  file  where  more  than  one  register  shares  a  bus.  Details  of  the  individual  components  are  given 
in  the  indicated  sections  and  figures.  Figure  4.2  (on  page  67  in  Chapter  4,  Common  Experimenial 
Framework)  shows  where  the  functions  in  the  above  diagram  are  performed  in  the  cfi77  compiler. 
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Chapter  7 

Concluding  Remarks 


In  this  chapter,  I  summarize  my  work  by  highlighting  the  contributions  in  this 
dissertation.  I  conclude  with  a  discussion  of  studies  for  future  work. 

7.1  Contributions  of  Dissertation 

Each  of  the  four  major  chapters  in  this  dissertation  contains  contributions  to  t^ 
areas  of  processor  design  and  code  optimization.  These  contributions  fall  into  one  of  thr^ 
categories:  improvements  to  previous  work,  syntheses  of  published  material,  and  extension 

to  the  state  of  the  art. 

7.1.1  Improvements  to  Previous  Work 

Chapters  2,  5,  and  6  contain  improvements  to  previous  work.  The  first  of  these 
improvements  is  in  Section  2.2  of  Chapter  2,  Fundamentals  of  Vector  Architectures  (pages  6 
to  n)  where  I  used  a  common  framework  to  compare  how  different  architecturd  classes 
support  fine-grain  parallelism.  Jouppi  and  Wall  have  done  a  similar 

architectural  classes  but  focussed  mainly  on  how  multiple  operations  are  initiated  [70].  My 
comparison  adds  to  theirs  by  including  techniques  for  the  simultaneous  dehvery 
and  Lults  Another  improvement  is  the  classification  of  implementations  for  a  multiported 
registert^^^  i7section  2';2.2  (pages  11  to  16).  Other  researchers  in  processor  design  have 
clLified  register  files  into  shared  and  split  ones,  which  I  call  monohthic  ^d  distributed, 
respectivelyVoS,  15].  I  expanded  this  classification  to  include  a  partitioned  register  file,  of 

which  a  vector  register  file  is  an  example.  r.  l  j  r- 

A  third  improvement  is  in  Chapter  5,  Register  Usage  and  JnstmeUan  Schedahng, 
for  which  I  developed  a  vector  scheduling  algorithm  that  is  better  able  to  use  nrore  vector 
e  JterrthS  the  scheduling  algorithm  used  in  the  version  of  the  cft77  compiler  I  used 
du^my  wo.k  term  at  Crly  Research,  Incorporated  in  the  M  of  1990.  DescnM  - 
Fkur!  5.9  (on  page  94),  my  algorithm  is  a  list  scheduling  one  that 

a  vector  architecture  and  is  similar  to  Tang  and  Davdison’s  simple  vector  scheduler  [107]. 
List  scheduling  algorithms  have  previously  been  used  for  VLIW  ^d  scalar  architec  ure 
[34  48].  An  algorithm  similar  to  the  one  I  developed  has  been  implemented  in  a  version 


147 


the  cft77  compiler  more  recent  than  the  one  I  used  for  my  studies  [62]. 

The  fourth  and  final  improvement  is  in  Chapter  6,  Bus  Usage  and  Register  As¬ 
signment,  for  which  I  developed  an  optimal  algorithm  that  assigns  values  to  the  minimum 
number  of  registers  for  a  given  execution  order  of  a  dependence  graph,  ^though  other  op¬ 
timal  algorithms  have  been  published  for  this  problem,  the  proof  of  their  optimality  relies 
on  the  use  of  a  stack  and  does  not  allow  any  choice  for  a  register  [44,  51,  27].  In  contrast, 

I  based  the  proof  for  my  algorithm,  which  is  presented  in  Section  6.2.3  (pages  123  to  130), 
on  a  necessary  but  not  sufficient  condition  for  optimality  that  is  completely  independent  of 
register  choice.  As  a  result,  my  algorithm  emphasizes  the  importance  of  assigning  values 
in  the  order  of  their  creation  times  and  does  not  specify  which  register  to  choose  when 
there  are  several  candidates,  a  choice  that  can  be  left  as  an  implementation  detail  without 
affecting  the  optimality  of  my  algorithm. 

7.1.2  Syntheses  of  Published  Material 

The  next  set  of  contributions,  which  appear  in  Chapters  2  and  3,  provide  new 
observations  of  already  published  material.  The  first  contribution  that  synthesizes  known 
material  concerns  stripmining,  which  is  the  classic  technique  for  executing  long  loops  with 
vector  instructions,  and  loop  unrolling,  which  is  a  standard  compiler  optimization  for  exe¬ 
cuting  a  loop  with  scalar  instructions.  In  Section  2.3.2  of  Chapter  2,  Fundamentals  of  Vector 
Architectures  (pages  24  to  29),  I  showed  that  using  vector  instructions  in  a  stripmined  loop 
is,  in  fact,  a  compact  form  of  loop  unrolling,  an  observation  that  was  made  parenthetically 
by  Jouppi  and  Wall  [70]. 

Another  contribution  from  this  chapter  is  the  presentation  of  the  properties  of  a 
vectorizable  program  fragment.  Because  it  is  the  responsibility  of  a  compiler  to  identify  such 
parts,  most  descriptions  of  a  vectorizable  program  fragment  are  given  from  the  perspective 
of  a  compiler.  However,  such  presentations  also  include  what  cannot  be  vectorized  because 
of  inadequate  compilation  technology.  In  contrast,  to  illustrate  the  restrictions  imposed 
by  vector  hardware,  I  derive  the  properties  of  a  vectorizable  program  fragment  based  on 
characteristics  of  vector  hardware  (in  Section  2.3.1,  pages  17  to  24). 

The  third  and  largest  contribution  is  the  synthesis  of  observations  and  data  in 
Chapter  3,  A  Case  for  Vector  Architectures.  Although  some  of  these  have  been  published 
before  and  others  are  obvious,  I  transform  these  individual  items  into  arguments  that  to¬ 
gether  advocate  the  implementation  of  vector  architectures  over  superscalar  ones  in  CMOS 
VLSI  technology.  Following  is  a  list  of  the  more  convincing  arguments: 

•  In  Section  3.1.3  (pages  38  to  44),  I  showed  how  the  partitioning  of  a  vector  register 
file  provides  8  times  as  many  registers  but  requires  only  1.25  times  as  much  area  as  a 
monolithic  register  file  with  64  registers  and  comparable  bandwidth. 

•  In  Section  3.2.1  (pages  46  to  51),  I  used  data  from  Wall’s  parallelism  study  to  show 
that  vectorizable  program  fragments  are  rich  in  parallelism  and  are,  furthermore,  most 
likely  to  be  the  more  time-consuming  programs  in  a  workload  [115]. 

•  In  the  subsequent  section.  Section  3.2.2  (pages  51  to  54),  to  demonstrate  the  effective¬ 
ness  of  this  type  of  parallelism,  I  explained  how  25  times  as  many  instructions  could 
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be  executed  if  the  hurdwure  were  to  make  fuB  ute  of  the  iutriucic 

workload.  The  use  of  paralleUsm  to  increase  the  size  of  a  workload  rather  than  reduce 

execution  time  was  first  documented  by  Gustafson  [54]. 

•  Finally,  because  it  would  be  unwise  to  completely  ignore  the  effects  of  AmdaW’s  Law, 

I  presLted  data  in  Section  3.2.3  (pages  54  to  58)  showing  that  a 
cldtecture,  such  as  the  Cray  Y-MP  scalar  processor,  can  tahe  advantage  of  t^e 
parallelism  in  non-vectorizable  program  fragments.  Moreover,  a  f 

Ln  of  data  published  by  Weiss  and  Smith  [119]  shows  that,  for  vertor^able  prop^ 
fragments,  vector  hardware  in  combination  with  superpipehned  hardware  provides 
sig^cant  improvement  in  performance  over  superpipelined  hardwwe  done.  Joupp 
aid  Wall  have  also  argued  for  the  use  of  superpipelined  hardware,  albeit  without  any 
vector  hardware,  to  support  fine-grain  parallelism  [70,  69]. 

7.1.3  Extensions  to  the  State  of  the  Art 

The  last  and  most  significant  set  of  contributions  appear  in  Chapters  5  and  6. 
These  contributions  extend  the  state  of  the  art  with  a  compiler  algorithm  for  a  new  repster 
organization  and  with  empirical  data  that  strengthen  qualitative  , 

contributions  is  in  Chapter  6,  5us  Usage  and  Register  Assignment, 

an  assignment  algorithm  for  a  vector  register  file  where  more  than  one  register  shares  a 
bus  In  Section  6.1  (pages  116  to  130),  I  modeled  the  problem  of  locally  assigmng  values 
Li  Lregjsir  file  L"  ^problem  of  coloring  two  graphs,  thus  building  upon  P-Jous  wo^^ 
that  uses  graph  coloring  to  model  the  problem  of  globally  assigning  registers  [35,  36  18]. 
Bused  ou  definitions  from  graph  theory,  I  also  denronstruted 

for  graphs  reveal  any  special  structure  that  these  graphs  may  have  [82,  49].  In  ad^tio^i,  I 
exteidi  my  algorithm  to  allow  assignments  of  the  form  Vi<-Vi  op  Vj  and  presented  da 
showing  that  the  reuse  of  registers  in  such  an  assignment  has  minimal  impact  on  over 

register  usage  (Section  6.4.2  (pages  133  to  139)). 

In  Addition  to  a  new  algorithm,  I  also  contributed  to  the  state  of  the  art  by  carry¬ 
ing  out  experiments  that  validated  hypotheses  concerning  the  effectiveness  of  the  compiler 

and  the  register  organizations  I  studied.  The  -Jrast-cture  for 
These  experiments,  which  was  provided  by  Cray  Research,  Incorporated  and  is  described  in 
Chapter  4,  Common  Experimental  Framework,  consists  of  the  following  three  items. 

1.  a  development  version  of  a  production  vectorizing  compiler, 

2.  a  simulator  that  models  the  Y-MP  vector  processor,  an  architecture  which  has  fully 
flexible  chaining  capabilities,  and 

3.  a  set  of  36  vectorizable  loops  that  are  extracted  from  actual  applications. 

Both  Chapters  5  and  6  contain  contributions  in  the  form  of  empirical  data. 

In  Section  5.1  of  Chapter  5,  Register  Usage  and  Instruction  Scheduling  (pages  72 
to  83),  I  hypothesized  that  both  more  than  8  vector  registers  and  a  scheduhng  alpnthm 
differLt  from  the  one  used  in  the  1990  version  of  the  cft77  compiler  are  needed  to  improve 
LLormance.  Although  two  research  groups  have  also  hypothesized  the  need  for  more 
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registers  and  a  third  has  carried  out  experiments  to  determine  how  many,  their  res^ch 
centers  around  using  polycyclic  vector  scheduling  for  vector  architectures  without  chaining 
capabilities  (108, 32, 33,  81].  The  empirical  data  I  presented  in  Section  5.4  (pages  95  to  104) 
not  only  validated  my  hypotheses  but  also  sharpened  the  qualitative  descriptions  I  presented 
by  indicating  how  many  more  registers  are  needed  to  improve  performance  by  how  much 
and  how  frequently. 

In  Section  6.1  of  Chapter  6,  Bus  Usage  and  Register  Assignment  111  to  116), 

I  observed  that  only  a  subset  of  simultaneously  live  values  are  actually  used  at  any  given 
time  and  hypothesized  that  partitioning  a  vector  register  file  would  redure  the  rost  of 
implementing  one  with  minimal  loss  in  performance.  The  data  I  presented  in  Section  6.4 
(pages  131  to  140)  not  only  indicates  the  effectiveness  of  my  assignment  algorithin  but 
also  shows  that  the  subset  of  simultaneously  live  values  is  small  enough  for  the  majority 
of  loops  in  the  CRI  workload  to  effectively  use  a  partitioned  vector  register  file.  Moreover, 
this  data  indicates  that  my  hypothesis  is  true  once  enough  partitions  are  provided,  thus 
providing  quantitative  evidence  that  partitioning  is  a  cost-effective  method  for  improving 
performance.  I  do  not  know  of  any  other  published  work  that  provides  data  demonstrating 
the  effectiveness  of  a  partitioned  vector  register  file. 


7.2  Future  Studies 

Although  my  investigations  varied  the  number  of  registers  and  buses  in  a  vector 
register  file,  the  number  of  elements  per  vector  register  remained  constant  at  64.  More 
simulation  studies  are  needed  to  determine  the  effect  of  longer  and  shorter  vector  lengths 
on  performance.  One  experiment  could  verify  the  hypothesis  that  increasing  the  number  of 
vector  registers  improves  performance  more  than  does  increasing  the  number  of  elements 
per  vector  register.  Another  experiment  could  determine  if  performance  does  not  decline 
significantly  when  the  vector  length  is  shortened  to  32  elements  per  vector  register  in  order  to 
demonstrate  that  a  register  organization  with  8  buses  and  32  registers  provides  an  excellent 
tradeoff  between  increased  cost  and  improved  performance. 

The  assignment  algorithm  I  presented  did  not  include  any  contingency  for  when 
number  of  registers  or  buses  assigned  exceeds  what  is  provided  in  hardware.  A  study 
for  the  future  is  to  modify  the  algorithm  to  handle  this  case  and  evaluate  the  impact  on 
performance  of  spilling  registers  in  a  partitioned  vector  register  file.  Of  particular  interest 
to  such  a  future  study  is  a  register  organization  with  8  buses  and  16  registers,  the  one  for 
which  I  was  unable  to  produce  strong  performance  data  in  Section  6.4  of  Chapter  6,  Bus 
Usage  and  Register  Assignment  (pages  131  to  140). 

Although  my  algorithm  for  optimally  assigning  registers  does  not  directly  address 
the  problem  of  spilling  registers,  it  can  be  easily  extended  to  do  so.  What  is  unclear 
is  whether  it  would  remain  optimal.  Horwitz  et  al.  as  well  as  Prabhala  and  Sethi  have 
developed  algorithms  that  spiU  registers  using  the  minimal  number  of  memory  references 
for  a  given  number  of  registers  in  hardware  (60,  95].  These  algorithms,  however,  apply  to 
index  registers  and  stack  registers,  respectively.  For  general  purpose  registers,  Hsu,  Fischer 
and  Goodman  have  presented  an  optimal  algorithm  for  register  spilling  that  is  an  extension 
of  the  algorithm  by  Horwitz  et  al.  [63].  Because  the  algorithm  is  based  on  enumeration. 
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however,  it  can  become  computationally  impractical  for  large 

these  algorithms  model  the  problem  of  spilling  registers  “  ^  ^^^Xg 

of  theoretical  interest  is  to  determine  whether  my  optimal 

registers  can  be  extended  to  spill  registers  using  the  minimal  number  of  memory  references 
fo,  a^iven 

a  list  schedding  algorithm  outperforms  the  one  used  by  the  W9«  vemou^f  ; 

oUer  the  reverse  is  true  for  a  few  dependence  graphs,  two  of  wbch  are  shown  m  iig 
ures’5  12  and  5  13  (on  pages  98  and  99).  An  undertaking  for  the  future  is  to 
alg:rLmThat  performs  i  well  as  the  hst  scheduler  hut  never 

compiler  One  possibility  is  an  algorithm  that  schedules  operations  m  an  order  that  w 
the  reverse  of  iL  order  used  by  either  the  cft77  or  list  scheduler  ^d  that  uses  a  place¬ 
ment  strategy  similar  to  the  cft7rs  one.  Although  such  an  algorithm  produces  the  s^ 
execution  ofider  as  does  the  cft77  scheduler  for  the  dependence  graphs  shown  “  ^ 

nres  5.12  and  5.13,  more  simulation  studies  are  needed  to  determine  “ 

well  or  better  for  the  rest  of  the  CRl  workload.  Because  both  the  order  ^d  the  strate^ 
differ  in  the  list  and  cft77  schedulers,  another  study  for  the  future  is  to  determne  whi^ 
^  order  or  sLtegy  -  has  more  impact  on  performance.  Because  there  are  dependence 
graphs  for  which  SJher  makes  a  difference,  a  potential  line  of  investigation  is  to  analyze 

actual  explored  various  register  organizations  but 

always  used  the  configuration  of  functional  units  from  the  Cray  Y-MP.  Another  projec 
for  the  future  is  to  co^are  the  performance  of  different  configurations  of  f'^^'^tion^  uml^s 
varying  both  the  number  and  types.  An  extension  to  this  proposed  study  those  n 

this  dissertation  is  to  determine  the  appropriate  numbers  of  ^ 

to  use  different  configurations  of  functional  units  effectively  in  an  attempt  to  estabhsh  a 

rule-of-thumb  that  would  provide  this  ratio  analytically.  Such  a  stu  y  wo  ^  ^ 

hvDothesis  that  a  partitioned  vector  register  file  with  4  buses  is  an  inadequate  design  f 
of  tac.io.rf  u„i.s  in  .he  Arden.  Ti.an  and  Fdji.sn  VPU- bm«y 
ctiWlitv  is  not  affected  by  a  change  of  functional  units,  an  interesting  value  to  quantify 
LThetparf  on  ;lrf— Uen  using  an  enecu.ion  order  scheduled  for  a  conSgurafion 

differen.  fuhy  dexihle  chaining.  A  «nrf  sn^«. 

.ion  for  fn.nre  worh  is  .o  combine  .his 

For  cos.  reasons,  i.  may  he  d^r- 

fbr.o  r’ed««  .he  level  of  chaining.  One  example  is  .o  preven.  veCor  memory 
from  chaining  wilh  non-memory  ones  in  order  .o  simplify  .he  memory  sp.em. 
less  chaining  provides  less  opportunl.y  for  parallelism  .o  occur,  soph.s..ca.ed  schednbng 
rfeorUhms  such  as  polycyclic  vector  scftedulinj,  can  be  used  .o  increase  the  amount  of  par- 
Sm  bui  aUhXn^  of  more  registers.  Interesting  values  to  compare  are  the  nntaers 
rf  registers  and  buses  needed  to  compensate  for  the  lack  of  chaining  and  the  numbers  needed 

when  there  is  chaining. 
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